Parallel and Grid I/O Infrastructure

1

Parallel and Grid I/O Infrastructure

Rob Ross, Argonne National LabParallel Disk Access and Grid I/O

(P4)

SDM All Hands MeetingMarch 26, 2002

2

Participants

Argonne National Laboratory

- Bill Gropp, Rob Ross, Rajeev Thakur, Rob Latham, Anthony Chan

Northwestern University

- Alok Choudhary, Wei-keng Liao, Avery Ching, Kenin Coloma, Jianwei Li

Collaborators

- Lawrence Livermore National Laboratory- Ghaleb Abdulla, Tina Eliassi-Rad, Terence Critchlow

- Application groups

3

Focus Areas in Project

Parallel I/O on clusters

- Parallel Virtual File System (PVFS) MPI-IO hints

- ROMIO MPI-IO implementation Grid I/O

- Linking PVFS and ROMIO with Grid I/O components Application interfaces

- NetCDF and HDF5 Everything is interconnected! Wei-keng Liao will drill down into specific tasks

4

Parallel Virtual File System

Lead developer R. Ross (ANL)- R. Latham (ANL), developer

- A. Ching, K. Coloma (NWU), collaborators Open source, scalable parallel file system

- Project began in mid 90’s at Clemson University

- Now a collaborative between Clemson and ANL Successes

- In use on large Linux clusters (OSC, Utah, Clemson, ANL, Phillips Petroleum, …)

- 100+ unique downloads/month

- 160+ users on mailing list, 90+ on developers list

- Multiple Gigabyte/second performance shown

5

Keeping PVFS Relevant: PVFS2

Scaling to thousands of clients and hundreds of servers requires some design changes

- Distributed metadata

- New storage formats

- Improved fault tolerance New technology, new features

- High-performance networking (e.g. Infiniband, VIA)

- Application metadata New design and implementation warranted

(PVFS2)

6

PVFS1, PVFS2, and SDM

Maintaining PVFS1 as a resource to community

- Providing support, bug fixes

- Encouraging use by application groups

- Adding functionality to improve performance (e.g. tiled display)

Implementing next-generation parallel file system

- Basic infrastructure for future PFS work

- New physical distributions (e.g. chunking)

- Application metadata storage Ensuring that a working parallel file system will

continue to be available on clusters as they scale

7

Data Staging for Tiled Display

Contact: Joe Insley (ANL)- Commodity components

- projectors, PCs

- Provide very high resolutionvisualization

Staging application preprocesses “frames” into a tile stream for each “visualization node”- Uses MPI-IO to access data from PVFS file system

- Streams of tiles are merged into movie files on visualization nodes

- End goal is to display frames directly from PVFS

- Enhancing PVFS and ROMIO to improve performance

8

Example Tile Layout

3x2 display, 6 readers Frame size is 2532x1408 pixels Tile size is 1024x768 pixels (overlapped) Movies broken into frames with each frame

stored in its own file in PVFS Readers pull data from PVFS and send to display

9

Tested access patterns

Subtile- Each reader grabs a piece of

a tile

- Small noncontiguous accesses

- Lots of accesses for a frame Tile

- Each reader grabs a whole tile

- Larger noncontiguous accesses

- Six accesses for a frame Reading individual pieces is

simply too slow

10

Noncontiguous Access in ROMIO

ROMIO performs “data sieving” to cut down number of I/O operations

Uses large reads which grab multiple noncontiguous pieces

Example, reading tile 1:

11

Noncontiguous Access in PVFS

ROMIO data sieving

- Works for all file systems (just uses contiguous read)

- Reads extra data (three times desired amount)

Noncontiguous access primitive allows requesting just desired bytes (A. Ching, NWU)

Support in ROMIO allowstransparent use of new optimization (K. Coloma,NWU)

PVFS and ROMIO supportimplemented

Normalized Read Performance

0

0.2

0.4

0.6

0.8

1

Subtile TileList I/ OData Sieving

12

Metadata in File Systems

Associative arrays of information related to a file Seen in other file systems (MacOS, BeOS, ReiserFS) Some potential uses:

- Ancillary data (from applications)- Derived values

- Thumbnail images

- Execution parameters

- I/O library metadata- Block layout information

- Attributes on variables

- Attributes of dataset as a whole

- Headers- Keeps header out of data stream

- Eliminates need for alignment in libraries

13

Metadata and PVFS2 Status

Prototype metadata storage for PVFS2 implemented

- R. Ross (ANL)

- Uses Berkeley DB for storage of keyword/value pairs

- Need to investigate how to interface to MPI-IO Other components of PVFS2 coming along

- Networking in testing (P. Carns, Clemson)

- Client side API under development (Clemson) PVFS2 beta early fourth quarter?

14

ROMIO MPI-IO Implementation

Written by R. Thakur (ANL)

- R. Ross and R. Latham (ANL), developers

- K. Coloma (NWU), collaborator

Implementation of MPI-2 I/O specification

- Operates on wide variety of platforms

- Abstract Device Interface for I/O (ADIO) aids in porting to new file systems

Successes

- Adopted by industry(e.g. Compaq, HP, SGI)

- Used at ASCI sites(e.g. LANL Blue Mountain)

15

ROMIO Current Directions

Support for PVFS noncontiguous requests

- K. Coloma (NWU) Hints - key to efficient use of HW & SW

components

- Collective I/O- Aggregation (synergy)

- Performance portability- Controlling ROMIO Optimizations

- Access patterns

- Grid I/O Scalability

- Parallel I/O benchmarking

16

ROMIO Aggregation Hints

Part of ASCI Software Pathforward project

- Contact: Gary Grider (LANL) Implementation by R. Ross, R. Latham (ANL) Hints control what processes do I/O in collectives Examples:

- All processes on same node as attached storage

- One process per host Additionally limit number of processes who open file

- Good for systems w/out shared FS (e.g. O2K clusters)

- More scalable

17

Aggregation Example

Cluster of SMPs Only one SMP box has connection to disks Data is aggregated to processes on single box Processes on that box perform I/O on behalf of the others

18

Optimization Hints

MPI-IO calls should be chosen to best describe the I/O taking place

- Use of file views

- Collective calls for inherently collective operations

Unfortunately sometimes choosing the “right” calls can result on lower performance

Allow application programmers to tune ROMIO with hints rather than using different MPI-IO calls

Avoid the misapplication of optimizations (aggregation, data sieving)

19

Optimization Problems

ROMIO checks for applicability of two-phase optimization when collective I/O is used

With tiled display application using subtile access, this optimization is never used

Checking for applicability requires communication between processes

Results in 33% drop in throughput (on test system)

A hint that tells ROMIO not to apply the optimization can avoid this without changes to the rest of the application

20

Access Pattern Hints

Collaboration between ANL and LLNL (and growing)

Examining how access pattern information can be passed to MPI-IO interface, through to underlying file system

Used as input to optimizations in MPI-IO layer Used as input to optimizations in FS layer as well

- Prefetching

- Caching

- Writeback

21

Status of Hints

Aggregation control finished Optimization hints

- Collectives, data sieving read finished

- Data sieving write control in progress

- PVFS noncontiguous I/O control in progress Access pattern hints

- Exchanging log files, formats

- Getting up to speed on respective tools

22

Parallel I/O Benchmarking

No common parallel I/O benchmarks New effort (consortium) to:

- Define some terminology

- Define test methodology

- Collect tests

Goal: provide a meaningful test suite with consistent measurement techniques

Interested parties at numerous sites (and growing)

- LLNL, Sandia, UIUC, ANL, UCAR, Clemson

In infancy…

23

Grid I/O

Looking at ways to connect our I/O work with components and APIs used in the Grid

- New ways of getting data in and out of PVFS

- Using MPI-IO to access data in the Grid

- Alternative mechanisms for transporting data across the Grid (synergy)

Working towards more seamless integration of the tools used in the Grid and those used on clusters and in parallel applications (specifically MPI applications)

Facilitate moving between Grid and Cluster worlds

24

Local Access to GridFTP Data

Grid I/O Contact: B. Allcock (ANL) GridFTP striped server provides high-throughput

mechanism for moving data across Grid Relies on proprietary storage format on striped

servers

- Must manage metadata on stripe location

- Data stored on servers must be read back from servers

- No alternative/more direct way to access local data

- Next version assumes shared file system underneath

25

GridFTP Striped Servers

Remote applications connect to multiple striped servers to quickly transfer data over Grid

Multiple TCP streams better utilize WAN network

Local processes would need to use same mechanism to get to data on striped servers

26

PVFS under GridFTP

With PVFS underneath, GridFTP servers would store data on PVFS I/O servers

Stripe information stored on PVFS metadata server

27

Local Data Access

Application tasks that are part of a local parallel job could access data directly off PVFS file system

Output from application could be retrieved remotely via GridFTP

28

MPI-IO Access to GridFTP

Applications such as tiled display reader desire remote access to GridFTP data

Access through MPI-IO would allow this with no code changes

ROMIO ADIO interface provides the infrastructure necessary to do this

MPI-IO hints provide means for specifying number of stripes, transfer sizes, etc.

29

WAN File Transfer Mechanism

B. Gropp (ANL), P. Dickens (IIT) Applications

- PPM and COMMAS (Paul Woodward, UMN)

Alternative mechanism for moving data across Grid using UDP

Focuses on requirements for file movement

- All data must arrive at destination

- Ordering doesn’t matter

- Lost blocks can be retransmitted when detected, but need not stop the remainder of the transfer

30

WAN File Transfer Performance

Performance of User-Level Protocol on Short and Long Haul Networks

0%10%20%30%40%50%60%70%80%90%

100%

50 100

200

300

400

500

600

700

800

900

1000

1500

2500

Number of Packets Received Before Sending an Acknowledgement Packet

Per

cen

tag

e o

f M

axim

um

B

and

wid

th O

bta

ined

Long Haul Network

Short Haul Network

TCP over Short and Long Haul

0%10%20%30%40%50%60%70%80%90%

100%

5K 10K

15K

20K

25K

30K

40K

50K

100K

1Meg

4Meg

40M

eg

Chunk Size (Bytes)

Per

cen

tag

e o

f M

axim

um

A

vaila

ble

Ban

dw

idth

Comparing TCP utilization to WAN FT technique

See 10-12% utilization with single TCP stream (8 streams to approach max. utilization)

With WAN FT obtain near 90% utilization, more uniform performance

31

Grid I/O Status

Planning with Grid I/O group

- Matching up components

- Identifying useful hints Globus FTP client library is available 2nd generation striped server being

implemented XIO interface prototyped

- Hooks for alternative local file systems

- Obvious match for PVFS under GridFTP

32

NetCDF

Applications in climate and fusion

- PCM- John Drake (ORNL)

- Weather Research and Forecast Model (WRF)- John Michalakes (NCAR)

- Center for Extended Magnetohydrodynamic Modeling- Steve Jardin (PPPL)

- Plasma Microturbulence Project- Bill Nevins (LLNL)

Maintained by Unidata Program Center API and file format for storing multidimensional

datasets and associated metadata (in a single file)

33

NetCDF Interface

Strong points:

- It’s a standard!

- I/O routines allow for subarray and strided access with single calls

- Access is clearly split into two modes- Defining the datasets (define mode)

- Accessing and/or modifying the datasets (data mode)

Weakness: no parallel writes, limited parallel read capability

This forces applications to ship data to a single node for writing, severely limiting usability in I/O intensive applications

34

Parallel NetCDF

Rich I/O routines and explicit define/data modes provide a good foundation

- Existing applications are already describing noncontiguous regions

- Modes allow for a synchronization point when file layout changes

Missing:

- Semantics for parallel access

- Collective routines

- Option for using MPI datatypes Implement in terms of MPI-IO operations Retain file format for interoperability

35

Parallel NetCDF Status

Design document created

- B. Gropp, R. Ross, and R. Thakur (ANL)

Prototype in progress

- J. Li (NWU)

Focus is on write functions first

- Biggest bottleneck for checkpointing applications

Read functions follow Investigate alternative file formats in future

- Address differences in access modes between writing and reading

36

FLASH Astrophysics Code

Developed at ASCI Center at University of Chicago

- Contact: Mike Zingale

Adaptive mesh (AMR) code for simulating astrophysical thermonuclear flashes

Written in Fortran90, uses MPI for communication, HDF5 for checkpointing and visualization data

Scales to thousands of processors, runs for weeks, needs to checkpoint

At the time, I/O was a bottleneck (½ of runtime on 1024 processors)

37

HDF5 Overhead Analysis

Instrumented FLASH I/O to log calls to H5Dwrite

MPI_File_write_atH5Dwrite

38

HDF5 Hyperslab Operations

White region is hyperslab “gather” (from memory) Cyan is “scatter” (to file)

39

Hand-Coded Packing

Packing time is in black regions between bars Nearly order of magnitude improvement

40

Wrap Up

Progress being made on multiple fronts

- ANL/NWU collaboration is strong

- Collaborations with other groups maturing Balance of immediate payoff and medium

term infrastructure improvements

- Providing expertise to application groups

- Adding functionality targeted at specific applications

- Building core infrastructure to scale, ensure availability

Synergy with other projects On to Wei-keng!

Documents

Parallel and Grid I/O Infrastructure