Upload
norah-davis
View
213
Download
0
Embed Size (px)
DESCRIPTION
3 Parallel NetCDF NetCDF defines: –A set of APIs for file access –A machine-independent file format Parallel netCDF work –New APIs for parallel access –Maintaining the same file format Tasks –Built on top of MPI for portability and high performance –Support C and Fortran interfaces –Support external data representations P0P1P2P3 netCDF Parallel File System Parallel netCDF P0P1P2P3 Parallel File System
Citation preview
SciDAC SDM Center All Hands Meeting, October 5-7, 2005
Northwestern University PIs: Alok Choudhary, Wei-keng Liao
Graduate Students: Jianwei Li, Avery Ching, Kenin Coloma
ANL Collaborators: Bill Gropp, Rob Ross, Rajeev Thakur, Rob Latham
Parallel I/O Middleware Optimizations and
Future Directions
2
• Progress and accomplishments – Wei-keng Liao– Parallel netCDF– Client-side file caching in MPI-IO– Data-type I/O for non-contiguous file access in PVFS
• Future research directions – Alok Choudhary– I/O middleware– Autonomic and Active storage Systems
Outline
3
Parallel NetCDF
• NetCDF defines:– A set of APIs for file access– A machine-independent file format
• Parallel netCDF work– New APIs for parallel access– Maintaining the same file format
• Tasks– Built on top of MPI for portability and
high performance– Support C and Fortran interfaces– Support external data representations
P0 P1 P2 P3
netCDF
Parallel File System
Parallel netCDF
P0 P1 P2 P3
Parallel File System
4
PnetCDF Current Status• Version 1.0.0 was released on July 27, 2005• Supported platforms
– Linux Cluster, IBM SP, SGI Origin, Cray X, NEC SX• Two sets of parallel APIs are completed
– High level APIs (mimicking the serial netCDF APIs)– Flexible APIs (extended to utilize MPI derived datatype)
• Fully supported both in C and Fortran• Support for large file ( > 4GB files)• Test suites
– Self test codes ported from Unidata netCDF package to validate against single-process results
– Parallel test codes for both sets of APIs
5
Illustrative PnetCDF Users• FLASH – astrophysical thermonuclear application from
ASCI/Alliances center at university of Chicago• ACTM – atmospheric chemical transport model, LLNL• WRF-ROMS – regional ocean model system I/O module from
scientific data technologies group, NCSA• ASPECT – data understanding infrastructure, ORNL• pVTK – parallel visualization toolkit, ORNL• PETSc – portable, extensible toolkit for scientific computation,
ANL• PRISM – PRogram for Integrated Earth System Modeling, users
from C&C Research Laboratories, NEC Europe Ltd.• ESMF – earth system modeling framework, national center for
atmospheric research• More …
6
PnetCDF Future Work
• Non-blocking I/O APIs• Performance improvement for data type
conversion– Type conversion while packing non-contiguous
buffers• Extending PnetCDF for newer applications, e.g.,
data analysis and mining• Collaboration with application users
7
File Caching in MPI-IO
Parallel netCDF
MPI-IO
PVFS
Applications
Storage devices
8
File Caching for Parallel Apps• Why file caching?
– Improves the performance for repeated file access– Enable write-behind strategy
• Accumulates multiple small writes to better utilize network bandwidth• May balance the work load for irregular I/O patterns• Useful for checkpointing
– Enable data pre-fetching• Useful for read-only applications (parallel data mining, visualization)
• Why not just use traditional caching strategies?– Each client performs independently cache incoherence– I/O servers are in charged with cache coherence control potential I/O
serialization– Inadequate for parallel environment where application clients frequently
read/write shared files
9
Caching Sub-system in MPI-IO
• Application-aware file caching – A user-level implementation in
MPI-IO library– MPI communicators define the
subsets of processes operating on a shared file
client processors
I/O servers
global cache poollocal cache
buffers
networkinterconnect
memory
– Processes cooperate with each other to perform caching– Data cached in one client can be directly accessed by another– Moves cache coherence control from servers to clients– Distributed coherence control (less overhead)
• Supports both collective and independent I/O
10
Design• Cache metadata
– File-block based granularity– Cyclically stored in all
processes • Global cache pool
– Comprises local memory of all processes
– Single copy of file data to avoid coherence issue
processes
1P 2P 3P0P
File logical partitioning
Distributed cache meta dataprocesses
block 9 statusblock 5 statusblock 1 status
block 10 statusblock 6 statusblock 2 status
block 11 statusblock 7 statusblock 3 status
block 8 statusblock 4 statusblock 0 status
1P 2P 3P0PGlobal cache pool
local memorylocal memory local memorylocal memory
page 3page 2page 1
block 4block 3block 2block 1block 0
page 3page 2page 1
page 3page 2page 1
page 3page 2page 1
• Two implementations:– Using an I/O thread (POSIX thread)– Using the MPI remote-memory-access (RMA) facility
11
Example Read OperationFile logical partitioning
block 4block 3block 2block 1block 0
1P 2P 3P0PDistributed metadata
processes
block 9 statusblock 5 statusblock 1 status
block 10 statusblock 6 statusblock 2 status
block 11 statusblock 7 statusblock 3 status
block 8 statusblock 4 statusblock 0 status
page 3page 2page 1
processes 1P 2P 3P0PGlobal poollocal memorylocal memory local memorylocal memory
page 3page 2page 1
page 3page 2page 1
page 3page 2page 1page 1
If not y
et ca
ched
page 2
Alre
ady
cach
ed
block 3
met
adat
a lo
okup
lock it !unlock it !
12
Future Work• Data pre-fetching
– Instructional (through MPI info) and non-instructional (based on sequential access)
• Collective write-behind for data check-pointing• Stand-alone distributed lock sub-system
– Using MPI-2 remote-memory access facility• Design new MPI file hints for caching• Application I/O pattern study
– Structured/unstructured AMR
13
Data-type I/O in PVFS
Parallel netCDF
MPI-IO
PVFS
Applications
Storage devices
14
Non-contiguous I/O• Four types
– Contiguous both in memory and file
– Contiguous in memory, non-contiguous in file
– Non-contiguous in memory, contiguous in file
– Non-contiguous both in memory and file
• Each segment is an I/O request of (offset, length)
memory
file
memory
file
memory
file
memory
file
15
Implementations• POSIX I/O
– One call per (offset, length)– Generates large number of I/O
requests• Data sieving
– Single (offset, length) covering multiple segments
– Accessing unused data and introduces consistency control overhead
• List I/O– Single calls handle multiple non-
contiguous access– Passing multiple (offset, length)s
across network
Application process
I/O request I/O request I/O request
Client-side file system
Application process
List I/O request
Client-side file system
Server-side file system
network
16
Data-type I/O• Single requests all the way to the
servers• Abandons offset-length pair
representation– Borrow MPI datatype concept to
describe non-contiguous access patterns
– New file system data types– New file system interfaces
• An implementation in PVFS– Both client and server sides
Application process
Datatype I/O request
PVFS client
PVFS server
network
Single request
17
Summary of Accomplishments
• High-level I/O– Parallel netCDF
• Low-level I/O– MPI-IO file caching
• Parallel file system– Data-type I/O in PVFS
Parallel netCDF
MPI-IO
PVFS
18
Future Research
19
Typical Components in I/O Systems• Based on a lot of current apps• High-level
– E.g., NetCDF, HDF, ABC– Applications use these
• Mid-level– E.g., MPI-IO– Performance experience
• Low-level– E.g., File systems – Critical for performance in above
• More access info lost if more components used
Compute node
Compute node
Compute node
Compute node
network
I/OServer
I/OServer
I/OServer
End-to-End Performance critical
Applications
Client-side File System
Parallel netCDF,HDF5, ...
MPI-IO
20
Collectives, independentsI/O hints: access style (read_once, write_mostly, sequential, random, …), collective buffering, chunking, striping
Open mode (O_RDONLY, O_WRONLY, O_SYNC), file status, locking, flushing, cache invalidationMachine dependent: data shipping, sparse access, double buffering
Access base on : file blocks, objects scheduling, aggregation
Read-ahead, write-behind, metadata management, file striping, security, redundancy
Save attributes along with data, external data types (byte-alignment), data structures (flexible dimensionality), hierarchical data model
Access patterns: shared files, individual files, data partitioning, check-pointing, data structures, inter-data relationship
network
I/OServer
I/OServer
I/OServer
Applications
Client-side File System
Parallel netCDF,HDF5, ...
MPI-IO
application-aware caching, pre-fetching, file grouping, “vector of bytes”, flexible caching control, object-based data alignment, memory-file layout mapping, more control over hardware, Shared file descriptors,
Group locks, flexible locking control, scalable metadata management, zero-copying, QoS, Shared file descriptors,
Active storage: data filtering,object-based/hierarchical storage management, indexing, mining, power-management
Caching, fault tolerance, read-ahead, write-behind, I/O load balance, wide-area, heterogeneous FS support, thread-safe
Graph-based data model
21
FS DM Datasets HSS
Goal
Decouple “What” from “How” andBe Proactive
cachingcollectivereorganize
load balanceFault-tolerance
Understand
App
1
App
2
App
3
App
4
I/O S
W O
PT
streaming/
Small/large
configuration
s/w layer
Regular/irregular
Local/remote
• user burdened• Ineffective interfaces• Non-communicating layers
Current
SpeedBWLatencyQoS
22
Component Design for I/O• Application-aware
– Capture application’s file access information– Relationship between files, objects, users
• Environment-aware– Network (reliability, security), storage devices (active disks)
• Context-aware– Binding data attributes to files, indexing for fast search
• High-performance I/O needs supports from– Languages + Compilers– I/O libraries– File systems– Storage devices
23
Component Interface Design• Informative
– Should deliver access/storage information top-down/bottom-up
• Flexibility– Should describe arbitrary data distribution in memory buffers,
files, storage devices• Functionality
– Asynchronous operations, read-ahead, write-behind, replications
– Provides ability for additional innovation• Object-based I/O
– For hardware control (I/O co-processor, active disk, object-based file systems, etc.)
24
Future Work in MPI-IO• Investigate interface extensions• Client-side caching sub-system
– Implementations for various I/O strategies: buffering, pre-fetching, replication, migration
– Adaptive caching mechanisms and algorithms for optimizing different access patterns
• Distributed mutual exclusive locking sub-system– Shared resources, such as files and memory– Pipeline locking (overlap lock waiting time with I/O)
• Work with HDF5 and parallel netCDF– Design I/O strategies for metadata and data
• Metadata: small, overlap, repeated, strong consistency requirement• Array data: large, less frequent update
25
Future Work in Parallel File Systems• File caching (focus on parallel apps)• File versioning
– Alternative to file locking– Reliability and availability aspects
• Guarantee atomicity in the presence of client or I/O system failure
• Can enable efficient RAID-type schemes in PFS (because of atomicity)
• Dynamic rebalancing of I/O
• File list lock– Locks to multiple regions in a single request
26
ML310-board4
ML310-board3
ML310-board2
Active Storage System (reconfigurable system)
External net
ML310-host
ML310-board1
Switch
• Xilinx XC2VP30 Virtex-II Pro family– 30,816 logic cells (3424 CLBs)– 2 PPC405 embedded cores– 2,448 Kb (136 18 Kb blocks) BRAM– 136 dedicated 18x18 multiplier blocks
• Software:– Data Mining– Encryption– Functions and runtime libs– Linux micro-kernel
27
MineBench - data mining benchmark suite