47
LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC MACSio 0.9 Design and Current Capabilities LLNL Internal Review Mark C. Miller June 2015

LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Embed Size (px)

Citation preview

Page 1: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

LLNL-PRES-673936This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract

DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

MACSio 0.9Design and Current Capabilities

LLNL Internal Review

Mark C. Miller

June 2015

Page 2: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Multi-purpose

Application-Centric

Scalable i/o

Proxy Application

We aim to use it to MACS-imize I/O performance

What is MACSio?

Page 3: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

This milestone will fill a gap in the DOE co-design space by developing a proxy application on which various parallel I/O methods and libraries can be easily tested and studied for scalability, performance impacts of software abstractions, impact of techniques such as compression or checksumming, and the use of burst buffers. The core of the application will be common system for generating collections of data structures (e.g. arrays [float|double|int], strings, key/value pairs) that in aggregate closely mimic the checkpoint data of real applications. Generation of the data will be user-definable through fully parameterized attributes for the types, quantity, lengths, and randomness of data. A plug-in architecture will be designed to allow users to study different I/O libraries (e.g. HDF5 vs netCDF), parallel I/O techniques (e.g. file-per-processor vs collective), and support for burst buffers (e.g. SCR).

L2 Milestone Description

Page 4: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

A functional proxy app with abstractions to support at least two I/O libraries commonly used at the ASC labs is demonstrated, and initial feedback sought from the broader community. Silo (r/w), HDF5 (w), Exodus (w), Raw Posix (w)

Full documentation on how the proxy app can be extended to support new plug-ins. Doxygenated sources and examples + design docs

A report (white paper or presentation) is delivered describing the proxy app, and reporting on performance results gathered through its use on at least two platforms (e.g. TLCC2 and Sequoia with Lustre). Design Doc + This presentation

L2 Milestone Completion Criteria

Page 5: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Existing “Benchmarks” are limited in scope• More on that in a later slide

Whole apps and app kernels are often inflexible• Often ignore I/O entirely• Able to test only one specific way of doing I/O

Measuring and diagnosing I/O performance is complicated• I/O Stack and level of abstraction• File systems are getting more complex• There are a myriad of options

Aim to evaluate a variety of I/O relevant options• HDF5 vs. netCDF (+ params within each lib)• Overhead of Silo on HDF5 or Exodus on netCDF• Collective vs. Independent• Different parallel I/O paradigms• hzip vs. szip vs. no-compression (of realistic data)• Burst buffers

Why an I/O Proxy App is Needed

Page 6: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Single, Shared File (SSF), Aka “N:1”, “Rich Mans”• Concurrent access to a single, shared file• Requires a “true” parallel file system• Sometimes further divided into strided and segmented

— My take: These terms suggest granularity of different MPI rank’s data co-mingling in file

• Often a challenge to get good, scalable performance• Best for academic, single physics, homogeneous, and/or structured, array codes

Multiple Independent File (MIF), Aka “N:M”, “Poor Mans”• Concurrent (app-managed) access to multiple files• File count independent of MPI comm size• Easier than SSF to get good, scalable performance• Best for multi-physics, heterogeneous, and/or unstructured mesh codes

File Per Processor (FPP), Aka “N:N”• Really just a special case of MIF (but also very common)• At extreme scale, places huge demand on file system metadata• Sometimes need “throttling” knob (number of simultaneous files)

Collective vs. Independent I/O Requests• Possible within any paradigm but collective typical/easiest with SSF

Common Parallel I/O Paradigms

Page 7: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

P0 P2P1 P3 P4

Collective

Independent

Single Shared

File

P0 P2P1 P3 P4

Mutiple Indep. File

0

Mutiple Indep. File

1

P0 P2P1 P3 P4Proc File

0Proc File

1Proc File

2Proc File

3Proc File

4

SSF (N:1)(strided shown)

MIF (N:M)

FPP (N:N)

Page 8: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Restart• Save state of code to restart it at a later time• Protect cost to re-compute against various failures• Other aspects

— Precision: Full, typically cannot tolerate loss— Extent of content: Everything needed to successfully restart the code— Frequency: MTBF, re-compute cost & user tolerance— Longevity: short, often never migrated to main file system (SCR)

— Portability (of data/files): Only writer needs to read it (not always) and often only on same platform

Analysis (AKA: plot, post-process, presentation, movies, etc.)• Save only key mesh(s) & var(s) from code essential to analysis• Other aspects:

— Precision: Single, often can tolerate loss— Extent of content: Only those data objects needed for the down-stream analysis— Frequency: varies widely, governed by down-stream presentation/analysis needs— Longevity: Can be years/decades, always migrated to main file system

— Portability (of data/files): Across many tools, platforms, file systems, versions of I/O libs and years of use leads to the need for useful data abstractions

Why codes do I/O?

Page 9: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Notes to Self and What They Teach us about the HPC I/O Stack When I write a note I never

expect anyone else to read• Often, result is something lik this...

• Sadly, if too much time passes, I can’t even read it myself ;)

Moral: Writing stuff you want others to read requires• care

• common conventions

• mutually agreed upon terms

• formalisms

• data models and abstractions

• “others”=“self” a year from now

Page 10: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Abstraction Levels

Real World Phenomena

Continuous Mathematics

Discrete Numerical Models

Prog. Language Constructs

Platform Primary Storage

File System

Secondary Storage (SW)

Secondary Storage (HW)

Picture includes levels we don’t often think of as part of HPC I/O stack

Level of Abstraction (LOA)and the HPC I/O Stack

Increasing LOA

Page 11: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Abstraction Levels

Real World Phenomena

Continuous Mathematics

Discrete Numerical Models

Prog. Language Constructs

Platform Primary Storage

File System

Secondary Storage (SW)

Secondary Storage (HW)

The HPC I/O Stack is similar toIP Protocol Stack

Increasing LOA

Page 12: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Abstraction Levels Abstraction Objects

Real World Phenomena Physics, Chemistry, Materials...

Continuous MathematicsPDEs, Fields, Topologies,

Manifolds

Discrete Numerical Models Meshes, Materials, Variables

Prog. Language Constructs Arrays, Structs, Lists, Trees

Platform Primary StorageInts, Floats, Pointers, Offsets,

Lengths

File SystemFiles, Dirs, Links, Permissions,

Modes

Secondary Storage (SW) Pages, Inodes, FATs, OSTs, OSDs

Secondary Storage (HW) Bits, Volumes, Sectors, Tracks

Level of Abstraction (LOA)and the HPC I/O Stack

Increasing LOA

Page 13: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Abstraction Levels Abstraction Objects Example Implementations

Real World Phenomena Physics, Chemistry, Materials... ALE3D, Albany, LAPS

Continuous MathematicsPDEs, Fields, Topologies,

ManifoldsMFEM, FiberBundles, Sheafs, SAF

Discrete Numerical Models Meshes, Materials, Variables Silo, LibMesh, Exodus, ITAPS, VTK

Prog. Language Constructs Arrays, Structs, Lists, Trees HDF5, netCDF, ArrayIO, PDB

Platform Primary StorageInts, Floats, Pointers, Offsets,

LengthsMPI-IO, XDR, stdio, aio, mmap

File SystemFiles, Dirs, Links, Permissions,

ModesPOSIX IO

Secondary Storage (SW) Pages, Inodes, FATs, OSTs, OSDs

PLFS, GPFS, HDFS, Lustre

ext2/3, zfs, xfs, hfs

Secondary Storage (HW) Bits, Volumes, Sectors, Tracks hd, sd, cd, dvd, tape, ramdisk, flash

Level of Abstraction (LOA)and the HPC I/O Stack

Increasing LOA

Page 14: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Abstraction Levels Abstraction Objects Example Implementations

Real World Phenomena Physics, Chemistry, Materials... ALE3D, Albany, LAPS

Continuous MathematicsPDEs, Fields, Topologies,

ManifoldsMFEM, FiberBundles, Sheafs, SAF

Discrete Numerical Models Meshes, Materials, Variables Silo, LibMesh, Exodus, ITAPS, VTK

Prog. Language Constructs Arrays, Structs, Lists, Trees HDF5, netCDF, ArrayIO, PDB

Platform Primary StorageInts, Floats, Pointers, Offsets,

LengthsMPI-IO, XDR, stdio, aio, mmap

File SystemFiles, Dirs, Links, Permissions,

ModesPOSIX IO

Secondary Storage (SW) Pages, Inodes, FATs, OSTs, OSDs

PLFS, GPFS, HDFS, Lustre

ext2/3, zfs, xfs, hfs

Secondary Storage (HW) Bits, Volumes, Sectors, Tracks hd, sd, cd, dvd, tape, ramdisk, flash

LOA of Existing Benchmarking Tools

Many Tools

Increasing LOA

Fewer Tools

Page 15: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Abstraction Levels Abstraction Objects Example Implementations

Real World Phenomena Physics, Chemistry, Materials... ALE3D, Albany, LAPS

Continuous MathematicsPDEs, Fields, Topologies,

ManifoldsMFEM, FiberBundles, Sheafs, SAF

Discrete Numerical Models Meshes, Materials, Variables Silo, LibMesh, Exodus, ITAPS, VTK

Prog. Language Constructs Arrays, Structs, Lists, Trees HDF5, netCDF, ArrayIO, PDB

Platform Primary StorageInts, Floats, Pointers, Offsets,

LengthsMPI-IO, XDR, stdio, aio, mmap

File SystemFiles, Dirs, Links, Permissions,

ModesPOSIX IO

Secondary Storage (SW) Pages, Inodes, FATs, OSTs, OSDs

PLFS, GPFS, HDFS, Lustre

ext2/3, zfs, xfs, hfs

Secondary Storage (HW) Bits, Volumes, Sectors, Tracks hd, sd, cd, dvd, tape, ramdisk, flash

LOA of MACSio

ExistingTools

MACSio

Increasing LOA

Page 16: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

LOA=Level of Abstraction PMode=Parallel Mode

• SSF=Single Shared File• MIF=Multiple Independent File• FPP=file per processor

Coll=Collective I/Os Abs Keep=Abstraction Preserving?

DIT=Data in transit services• precision, cksum, compress, re-order...

MPI+=MPI+MC/OpenMP/Threads Perf DB=Performance Database EZ-Extend (how hard to add)

• Parallel Mode, I/O library, X in MPI+X

Evaluation of Existing Benchmarks

Page 17: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

What I/O performance will application X see on platform Y?

What overhead does Silo cost over HDF5?

Can collective I/O to SSF achieve better performance than MIF?

How much will compression algorithm X improve performance?

Why does HDF5 with >10000 groups hit a wall?

Application A writes 100 Gb/s via stdio, why are we seeing only 1 Gb/s with library B?

Simple Questions, Hard to Answer

Page 18: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Notional I/O Performance

Silo

HDF5

Lustre

Raw

Request Size

I/O

Ban

dw

idth

% o

f d

um

p

Disk hardware limit

20%

Overheads @this request size

Dow

n H

PC

I/O

Sta

ck

{{{

request from app

request fromsomewhere in stack

Page 19: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Useful Features of an I/O Proxy Appand how MACSio addresses them Easy scaling of data sizes

Physically realistic data, Not all zeros/random()

Variety of I/O libraries

Various I/O Paradigms

Data Validation

Able to drive DIT services

Able to drive variety & mix of I/O loads

Easy Timing/Logging

Integrate with other tools

Mesh part size & # parts/rank

Variable exprs + noise, Vary size, shape across rank

Plugins: HDF5, Silo, Exodus...

SSF, MIF, FPP, coll./indep.

Use plugin reader + cksums

float conv., compress, cksum, etc.

Restart/plot, movie, linker, time-histories

Timing and logging classes

SCR, VisIt, Darshan, Caliper

Page 20: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Images above are variations on Ken Perlin’s procedural texture algorithms

Metric: It kinda looks like stuff HPC generates

Perlin Noise or some other Chaotic Process for Numerically Similar Data

Page 21: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Very simple mesh + variable generation• Vars are expressions over global spatial extents box

Very simple decomp of mesh across MPI ranks• Different ranks can have different # of mesh parts• Currently, individual parts are all same size

— Should add option for randomization around nominal size

Static plugins linked into MACSio @ link time• All methods in a plugin are file-scope static• MACSio main gets pointers to key methods during init

Uses modified JSON-C library for marshaling data between MACSio and plugins• Opted for JSON-C to support Fortran plugins

How MACSio 0.9 Operates

Page 22: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

{“clargs”:<clarg-obj>,

“parallel”: {“mpi_size”:<int>, “mpi_rank”:<int>},

“global_mesh”: {“num_parts”:<int>,

“bounds”: [<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>]},

“local_mesh”: [<part-obj>,<part-obj>,...]

}

MACSio’s Main Object (Dictionary)

Page 23: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Would like to use VisIt’s/LibSim2 Metadata

MACSio’s Main Object (Dictionary) "Mesh": { "MeshType": "rectilinear", "ChunkID": 0, # unique per part "GeomDim": 2, "TopoDim": 2, "LogDims": [50,100], "Bounds": [0,0,0,1,1,0], "Coords": { "CoordBasis": "X,Y,Z", "XAxisCoords": [...], "YAxisCoords": [...] }, "Topology": { "Type": "Templated", "DomainDim": 2, "RangeDim": 0, "ElemType": "Quad4", "Template": [0,1,51,50] } },

"Vars": [ {...}, {...}, { "name": "spherical", "centering": "zone", "data": [...] }, {...}, {...}, {...} ], "GlobalLogIndices": [...], "GlobalLogOrigin": [...]}

Page 24: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Global parts are identical across ranks

Local parts are specific to each rank

No communication involved (yet)

Once generated, call plugin’s Dump method

/* do the dump */

(*(iface->dumpFunc))(argi, argc, argv, main_obj, dumpNum, dumpTime);

JSON Data object constructed on all processors

Page 25: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Traverse main JSON-C object

Query for various metadatarank = JsonGetInt(main_obj, "parallel/mpi_rank");

size = JsonGetInt(main_obj, "parallel/mpi_size");

int ndims = JsonGetInt(part, "Mesh/GeomDim");

Decide equivalent object for plugin to write• For rect mesh & Silo, DBPutQuadmesh/DBPutQuadvar

Convert where necessary (time apart from i/o)

Output the data (time as i/o)

MACSio Plugin Operation

Page 26: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

For events, diagnotics, debugging, etc.

Single row & column formatted ascii file• Default #cols is 128 but settable on CL• Default #rows per mpi is 64 but settable on CL

Each rank’s section acts like a circular queue• Can loose messages but always have most recent N

prior to a serious issue

MACSio Logging

Page 27: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

mpirun –np 4 macsio –interface silo • defaults to write, 1 part per rank, 1 file per processor (MIF), 80000 byte requests, 10 dumps

mpirun –np 4 macsio –parallel_file_mode MIF 3 –part_size 1M \ –avg_num_parts 2.5 –interface hdf5• each mesh part is 1Mb, 10 dumps to 3 MIF files using HDF5 interface

mpirun –np 4 macsio –interface hdf5 \

–parallel_file_mode SIF –plugin-args –compression gzip level=9

mpirun –np 4 macsio –read_path foo.silo• Have attempted to make Silo read general enough to read any code’s restart/plot

mpirun –np 4 macsio –interface exodus –plugin_args \ –use_large_model always

mpirun –np 4 macsio –interface hdf5 –meta_type tabular \--meta_size 5M 20M

Sample usage

Page 28: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

MACSio Examples

4 processors, avg_num_parts = 2.5, tot=10 parts

Silo output

Showing spherical, sinusoid and random vars

Page 29: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Initial look at Perlin noise

Page 30: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Parts vs. Ranks avg_num_parts=2.5

• Some ranks get 2• Some ranks get 3

Minor bug:• More realistic if middle

row parts were swapped

Page 31: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Preliminary runs on Vulcan

Page 32: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

How many bytes in a “Kilo”-byte?

Does “Kilo” mean 1000 or 1024?• 2.4% error for “Kilo” ~10% error for “Tera”

International Electrotechnical Commission (IEC)• “Decimal” prefixes (1000 bytes), Kb, Mb, Gb, Tb, Pb• “Binary” prefixes (1024 bytes), Ki, Mi, Gi, Ti, Pi— Kibibyte, Mebibyte, Gibibyte, Tebibyte, Pebibyte.

MACSio will use either (default is Binary)• --units_prefix_system CL argument

A note about SI Prefixes

Page 33: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Preliminary tests on vulcan

0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 45000000

500

1000

1500

2000

2500

Bandwidth as a function of request sizevulcan (1 core per node to 1K nodes in pdebug, then

2/node, 4/node, etc.)Silo, writes, MIF

256-16

512-32

1024-32

2048-64

4096-64

8192-128

Per task request size in 1000's of bytes

Ag

gre

ga

te B

an

dw

idth

in

Mi/

se

c

Page 34: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Same data, scaling plot

0 1000 2000 3000 4000 5000 6000 7000 8000 90000

500

1000

1500

2000

2500

Preliminary Performance Gathering with MACSio on Vulcanwrites, MIF, Silo

400 bytes

4000 bytes

40000 bytes

80000 bytes

160000 bytes

320000 bytes

640000 bytes

1000000 bytes

4000000 bytes

# of processors

Ag

gre

ga

te B

an

dw

idth

(M

i/s

ec

)

Page 35: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

0 50000 100000 150000 200000 250000 300000 3500000

200

400

600

800

1000

1200

1400

1600

1800

2000

HDF5 4-way MIF (Sierra, No compression)

12

24

48

96

192

Page 36: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

0 50000 100000 150000 200000 250000 300000 3500000

50

100

150

200

250

HDF5 4-way MIF (Sierra, gzip level=9)

12 procs

24 procs

48 procs

96 procs

Request Size (bytes)

Ban

dw

idth

Mi/

sec

Page 37: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

HDF5 SIF on Sierra

0 50000 100000 150000 200000 250000 300000 3500000

10

20

30

40

50

60

70

80

90

HDF5 SIF (Sierra)

12 proc

24 proc

Page 38: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

srun –n and –N argument confusion

Silo BG-optimized VFD not getting used• default setting for “bgp” and “bgl” but no “bgq”

Not going to large enough request sizes

Not getting enough resources for largest runs

What is better measure for aggregate BW?• Sum of each BW observed at each task?• Total bytes / (Last Finisher’s Time – First start’s Time)— This is generally 10-20% lower

Initial Performance DataCollection Missteps

Page 39: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Documentation

Page 40: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Currently all C and uses GMake

MACSio main + utils: ~4500 lines (including doc)

MACSio plugins• Silo: ~900 lines• HDF5: ~600 lines• Exodus: ~600 lines• Raw-Posix: ~350 lines— Its currently called the “miftmpl” for MIF Template

About MACSio

Page 41: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Currently on LLNL CZ Git/Stash• Can give permission to anyone with LLNL CZ Token

BSD Open Source Release in progress• Expected BSD release end of June • Put up link on codesign.llnl.gov• Maybe “mirror” on GitHub or (suggestions)

Volunteers for new plugins? Users?• Contact: [email protected]

MACSio Availability & Next Steps

Page 42: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Short Term:• Mark to continue support/development but taper down

Medium Term:• Shift responsibilities to a dedicated proxy app team

Long Term:• Evolve to a true community open source project

Next Steps

Page 43: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Task Difficulty Estimated Completion

Remaining documentation low Q3-15

Handle amorphous metadata all plugins low Q3-15

BSD Release and support new users low Q3-15

Finish Filesystem ops functionality med Q4-15

Fix node duplication low Q4-15

Checksum validation (on read) med Q4-15

Add “trickle” dump class med Q1-16

Variables on only some mesh parts med Q1-16

Add at least one bi-modal plugin (ITAPS) low Q2-16

Mesh subsets (nodesets, facesets, etc) med Q2-16

Integrate Caliper (replace timing/log) med Q2-16

Variable exprs on command-line med Q2-16

Additional mesh types, variables, materials med Q3-16

Short & Medium Term Priorities

Page 44: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Task Difficulty Estimated Time

Sanity check MACSio vs. Code via Darshan hi Q3-16

Controllable task Mappings (files, I/O nodes) hi Q4-16

Per-code CL arg-sets db hi Q4-16

Performance db stand up and population hi Q1-17

Asynchronous I/O (need a lib capable) w/Compute hi Q3-17

Longer Term Thoughts

Page 45: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

This milestone will fill a gap in the DOE co-design space by developing a proxy application on which various parallel I/O methods and libraries can be easily tested and studied for scalability, performance impacts of software abstractions, impact of techniques such as compression or checksumming, and the use of burst buffers. The core of the application will be common system for generating collections of data structures (e.g. arrays [float|double|int], strings, key/value pairs) that in aggregate closely mimic the checkpoint data of real applications. Generation of the data will be user-definable through fully parameterized attributes for the types, quantity, lengths, and randomness of data. A plug-in architecture will be designed to allow users to study different I/O libraries (e.g. HDF5 vs netCDF), parallel I/O techniques (e.g. file-per-processor vs collective), and support for burst buffers (e.g. SCR).

L2 Milestone Description

Page 46: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

A functional proxy app with abstractions to support at least two I/O libraries commonly used at the ASC labs is demonstrated, and initial feedback sought from the broader community. Silo (r/w), HDF5 (w), Exodus (w), Raw Posix (r/w)

Full documentation on how the proxy app can be extended to support new plug-ins. Doxygenated sources and examples + design docs

A report (white paper or presentation) is delivered describing the proxy app, and reporting on performance results gathered through its use on at least two platforms (e.g. TLCC2 and Sequoia with Lustre). Design Doc + This presentation

L2 Milestone Completion Criteria

Page 47: LLNL-PRES-673936 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Lawrence Livermore National Laboratory LLNL-PRES-673936

Rob Neely and Chris Clouse• For funding me to work on this

Eric Brugger and Cyrus Harrison• For allowing me time away from my other

responsibilities

Thanks