SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,

SciDAC SDM Center All Hands Meeting, October 5-7, 2005

Northwestern University PIs: Alok Choudhary, Wei-keng Liao

Graduate Students: Jianwei Li, Avery Ching, Kenin Coloma

ANL Collaborators: Bill Gropp, Rob Ross, Rajeev Thakur, Rob Latham

Parallel I/O Middleware Optimizations and

Future Directions

2

• Progress and accomplishments – Wei-keng Liao– Parallel netCDF– Client-side file caching in MPI-IO– Data-type I/O for non-contiguous file access in PVFS

• Future research directions – Alok Choudhary– I/O middleware– Autonomic and Active storage Systems

Outline

3

Parallel NetCDF

• NetCDF defines:– A set of APIs for file access– A machine-independent file format

• Parallel netCDF work– New APIs for parallel access– Maintaining the same file format

• Tasks– Built on top of MPI for portability and

high performance– Support C and Fortran interfaces– Support external data representations

P0 P1 P2 P3

netCDF

Parallel File System

Parallel netCDF

P0 P1 P2 P3

Parallel File System

4

PnetCDF Current Status• Version 1.0.0 was released on July 27, 2005• Supported platforms

– Linux Cluster, IBM SP, SGI Origin, Cray X, NEC SX• Two sets of parallel APIs are completed

– High level APIs (mimicking the serial netCDF APIs)– Flexible APIs (extended to utilize MPI derived datatype)

• Fully supported both in C and Fortran• Support for large file ( > 4GB files)• Test suites

– Self test codes ported from Unidata netCDF package to validate against single-process results

– Parallel test codes for both sets of APIs

5

Illustrative PnetCDF Users• FLASH – astrophysical thermonuclear application from

ASCI/Alliances center at university of Chicago• ACTM – atmospheric chemical transport model, LLNL• WRF-ROMS – regional ocean model system I/O module from

scientific data technologies group, NCSA• ASPECT – data understanding infrastructure, ORNL• pVTK – parallel visualization toolkit, ORNL• PETSc – portable, extensible toolkit for scientific computation,

ANL• PRISM – PRogram for Integrated Earth System Modeling, users

from C&C Research Laboratories, NEC Europe Ltd.• ESMF – earth system modeling framework, national center for

atmospheric research• More …

6

PnetCDF Future Work

• Non-blocking I/O APIs• Performance improvement for data type

conversion– Type conversion while packing non-contiguous

buffers• Extending PnetCDF for newer applications, e.g.,

data analysis and mining• Collaboration with application users

7

File Caching in MPI-IO

Parallel netCDF

MPI-IO

PVFS

Applications

Storage devices

8

File Caching for Parallel Apps• Why file caching?

– Improves the performance for repeated file access– Enable write-behind strategy

• Accumulates multiple small writes to better utilize network bandwidth• May balance the work load for irregular I/O patterns• Useful for checkpointing

– Enable data pre-fetching• Useful for read-only applications (parallel data mining, visualization)

• Why not just use traditional caching strategies?– Each client performs independently cache incoherence– I/O servers are in charged with cache coherence control potential I/O

serialization– Inadequate for parallel environment where application clients frequently

read/write shared files

9

Caching Sub-system in MPI-IO

• Application-aware file caching – A user-level implementation in

MPI-IO library– MPI communicators define the

subsets of processes operating on a shared file

client processors

I/O servers

global cache poollocal cache

buffers

networkinterconnect

memory

– Processes cooperate with each other to perform caching– Data cached in one client can be directly accessed by another– Moves cache coherence control from servers to clients– Distributed coherence control (less overhead)

• Supports both collective and independent I/O

10

Design• Cache metadata

– File-block based granularity– Cyclically stored in all

processes • Global cache pool

– Comprises local memory of all processes

– Single copy of file data to avoid coherence issue

processes

1P 2P 3P0P

File logical partitioning

Distributed cache meta dataprocesses

block 9 statusblock 5 statusblock 1 status




1P 2P 3P0PGlobal cache pool

local memorylocal memory local memorylocal memory

page 3page 2page 1

block 4block 3block 2block 1block 0

page 3page 2page 1

page 3page 2page 1

page 3page 2page 1

• Two implementations:– Using an I/O thread (POSIX thread)– Using the MPI remote-memory-access (RMA) facility

11

Example Read OperationFile logical partitioning

block 4block 3block 2block 1block 0

1P 2P 3P0PDistributed metadata

processes





page 3page 2page 1

processes 1P 2P 3P0PGlobal poollocal memorylocal memory local memorylocal memory

page 3page 2page 1

page 3page 2page 1

page 3page 2page 1page 1

If not y

et ca

ched

page 2

Alre

ady

cach

ed

block 3

met

adat

a lo

okup

lock it !unlock it !

12

Future Work• Data pre-fetching

– Instructional (through MPI info) and non-instructional (based on sequential access)

• Collective write-behind for data check-pointing• Stand-alone distributed lock sub-system

– Using MPI-2 remote-memory access facility• Design new MPI file hints for caching• Application I/O pattern study

– Structured/unstructured AMR

13

Data-type I/O in PVFS

Parallel netCDF

MPI-IO

PVFS

Applications

Storage devices

14

Non-contiguous I/O• Four types

– Contiguous both in memory and file

– Contiguous in memory, non-contiguous in file

– Non-contiguous in memory, contiguous in file

– Non-contiguous both in memory and file

• Each segment is an I/O request of (offset, length)

memory

file

memory

file

memory

file

memory

file

15

Implementations• POSIX I/O

– One call per (offset, length)– Generates large number of I/O

requests• Data sieving

– Single (offset, length) covering multiple segments

– Accessing unused data and introduces consistency control overhead

• List I/O– Single calls handle multiple non-

contiguous access– Passing multiple (offset, length)s

across network

Application process

I/O request I/O request I/O request

Client-side file system

Application process

List I/O request

Client-side file system

Server-side file system

network

16

Data-type I/O• Single requests all the way to the

servers• Abandons offset-length pair

representation– Borrow MPI datatype concept to

describe non-contiguous access patterns

– New file system data types– New file system interfaces

• An implementation in PVFS– Both client and server sides

Application process

Datatype I/O request

PVFS client

PVFS server

network

Single request

17

Summary of Accomplishments

• High-level I/O– Parallel netCDF

• Low-level I/O– MPI-IO file caching

• Parallel file system– Data-type I/O in PVFS

Parallel netCDF

MPI-IO

PVFS

18

Future Research

19

Typical Components in I/O Systems• Based on a lot of current apps• High-level

– E.g., NetCDF, HDF, ABC– Applications use these

• Mid-level– E.g., MPI-IO– Performance experience

• Low-level– E.g., File systems – Critical for performance in above

• More access info lost if more components used

Compute node

Compute node

Compute node

Compute node

network

I/OServer

I/OServer

I/OServer

End-to-End Performance critical

Applications

Client-side File System

Parallel netCDF,HDF5, ...

MPI-IO

20

Collectives, independentsI/O hints: access style (read_once, write_mostly, sequential, random, …), collective buffering, chunking, striping

Open mode (O_RDONLY, O_WRONLY, O_SYNC), file status, locking, flushing, cache invalidationMachine dependent: data shipping, sparse access, double buffering

Access base on : file blocks, objects scheduling, aggregation

Read-ahead, write-behind, metadata management, file striping, security, redundancy

Save attributes along with data, external data types (byte-alignment), data structures (flexible dimensionality), hierarchical data model

Access patterns: shared files, individual files, data partitioning, check-pointing, data structures, inter-data relationship

network

I/OServer

I/OServer

I/OServer

Applications

Client-side File System

Parallel netCDF,HDF5, ...

MPI-IO

application-aware caching, pre-fetching, file grouping, “vector of bytes”, flexible caching control, object-based data alignment, memory-file layout mapping, more control over hardware, Shared file descriptors,

Group locks, flexible locking control, scalable metadata management, zero-copying, QoS, Shared file descriptors,

Active storage: data filtering,object-based/hierarchical storage management, indexing, mining, power-management

Caching, fault tolerance, read-ahead, write-behind, I/O load balance, wide-area, heterogeneous FS support, thread-safe

Graph-based data model

21

FS DM Datasets HSS

Goal

Decouple “What” from “How” andBe Proactive

cachingcollectivereorganize

load balanceFault-tolerance

Understand

App

1

App

2

App

3

App

4

I/O S

W O

PT

streaming/

Small/large

configuration

s/w layer

Regular/irregular

Local/remote

• user burdened• Ineffective interfaces• Non-communicating layers

Current

SpeedBWLatencyQoS

22

Component Design for I/O• Application-aware

– Capture application’s file access information– Relationship between files, objects, users

• Environment-aware– Network (reliability, security), storage devices (active disks)

• Context-aware– Binding data attributes to files, indexing for fast search

• High-performance I/O needs supports from– Languages + Compilers– I/O libraries– File systems– Storage devices

23

Component Interface Design• Informative

– Should deliver access/storage information top-down/bottom-up

• Flexibility– Should describe arbitrary data distribution in memory buffers,

files, storage devices• Functionality

– Asynchronous operations, read-ahead, write-behind, replications

– Provides ability for additional innovation• Object-based I/O

– For hardware control (I/O co-processor, active disk, object-based file systems, etc.)

24

Future Work in MPI-IO• Investigate interface extensions• Client-side caching sub-system

– Implementations for various I/O strategies: buffering, pre-fetching, replication, migration

– Adaptive caching mechanisms and algorithms for optimizing different access patterns

• Distributed mutual exclusive locking sub-system– Shared resources, such as files and memory– Pipeline locking (overlap lock waiting time with I/O)

• Work with HDF5 and parallel netCDF– Design I/O strategies for metadata and data

• Metadata: small, overlap, repeated, strong consistency requirement• Array data: large, less frequent update

25

Future Work in Parallel File Systems• File caching (focus on parallel apps)• File versioning

– Alternative to file locking– Reliability and availability aspects

• Guarantee atomicity in the presence of client or I/O system failure

• Can enable efficient RAID-type schemes in PFS (because of atomicity)

• Dynamic rebalancing of I/O

• File list lock– Locks to multiple regions in a single request

26

ML310-board4

ML310-board3

ML310-board2

Active Storage System (reconfigurable system)

External net

ML310-host

ML310-board1

Switch

• Xilinx XC2VP30 Virtex-II Pro family– 30,816 logic cells (3424 CLBs)– 2 PPC405 embedded cores– 2,448 Kb (136 18 Kb blocks) BRAM– 136 dedicated 18x18 multiplier blocks

• Software:– Data Mining– Encryption– Functions and runtime libs– Linux micro-kernel

27

MineBench - data mining benchmark suite

Documents

SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,