24
© 2009 IBM Corporation pNFS, POSIX, and MPI-IO: A Tale of Three Semantics Dean Hildebrand, Roger Haskin IBM Almaden Arifa Nisar Northwestern University Dean Hildebrand – Research Staff Member PDSW 2009

pNFS, POSIX, and MPI-IO: A Tale of Three Semantics › pdsw09 › resources › pdsw-pnfsmpiio-hildebrand-pu… · Good : MPI-IO and pNFS share similar relaxed semantics Bad : HPC

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: pNFS, POSIX, and MPI-IO: A Tale of Three Semantics › pdsw09 › resources › pdsw-pnfsmpiio-hildebrand-pu… · Good : MPI-IO and pNFS share similar relaxed semantics Bad : HPC

© 2009 IBM Corporation

pNFS, POSIX, and MPI-IO: A Tale of Three Semantics

Dean Hildebrand, Roger Haskin ― IBM AlmadenArifa Nisar ― Northwestern University

Dean Hildebrand – Research Staff Member

PDSW 2009

Page 2: pNFS, POSIX, and MPI-IO: A Tale of Three Semantics › pdsw09 › resources › pdsw-pnfsmpiio-hildebrand-pu… · Good : MPI-IO and pNFS share similar relaxed semantics Bad : HPC

© 2009 IBM Corporation2

PDSW’09

Agenda

Motivation

pNFS

HPC consistency requirements

Protocol consistency semantics

NFSv3 ADIO driver

pNFS ADIO driver

Page 3: pNFS, POSIX, and MPI-IO: A Tale of Three Semantics › pdsw09 › resources › pdsw-pnfsmpiio-hildebrand-pu… · Good : MPI-IO and pNFS share similar relaxed semantics Bad : HPC

© 2009 IBM Corporation3

PDSW’09

Motivation: Commodity parallel file system clients

� Supercomputers can be connected with multiple parallel file systems– GPFS, PVFS2, PanFS, Lustre

� Want single file system client to access all available storage systems

Storage Systems

StorageSupercomputersI/O Nodes

File System Clients

Page 4: pNFS, POSIX, and MPI-IO: A Tale of Three Semantics › pdsw09 › resources › pdsw-pnfsmpiio-hildebrand-pu… · Good : MPI-IO and pNFS share similar relaxed semantics Bad : HPC

© 2009 IBM Corporation4

PDSW’09

Agenda

Motivation

pNFS

HPC consistency requirements

Protocol consistency semantics

NFSv3 ADIO driver

pNFS ADIO driver

Page 5: pNFS, POSIX, and MPI-IO: A Tale of Three Semantics › pdsw09 › resources › pdsw-pnfsmpiio-hildebrand-pu… · Good : MPI-IO and pNFS share similar relaxed semantics Bad : HPC

© 2009 IBM Corporation5

PDSW’09

pNFS

pNFS Client

Layout Driver

I/O & Policy API

pNFS Server

Parallel File System

VFS API

Client

Data Servers

State Server

Parallel I/O

Metadata

Management Protocol

NFSv4.1/pNFS now has:

� Integrated locking protocol

� Open/Close with per-object change attribute

Page 6: pNFS, POSIX, and MPI-IO: A Tale of Three Semantics › pdsw09 › resources › pdsw-pnfsmpiio-hildebrand-pu… · Good : MPI-IO and pNFS share similar relaxed semantics Bad : HPC

© 2009 IBM Corporation6

PDSW’09

pNFS with GPFS

StorageGPFS NSD Servers Or SAN RAID controllers

File-based NFSv4.1 Clients GPFS Data and State

Servers

� Fully-symmetric GPFS architecture - scalable data and metadata– pNFS client can mount and retrieve layout from any GPFS node– metadata requests can be load balanced across cluster

� pNFS server and native GPFS clients can share the same file system– Backup, dedup, and other mgmt functions don’t need to be done over NFS

� Need robust interface between NFSD and GPFS

AIX

Solaris

Linux

Page 7: pNFS, POSIX, and MPI-IO: A Tale of Three Semantics › pdsw09 › resources › pdsw-pnfsmpiio-hildebrand-pu… · Good : MPI-IO and pNFS share similar relaxed semantics Bad : HPC

© 2009 IBM Corporation7

PDSW’09

Agenda

Motivation

pNFS

HPC consistency requirements

Protocol consistency semantics

NFSv3 ADIO driver

pNFS ADIO driver

Page 8: pNFS, POSIX, and MPI-IO: A Tale of Three Semantics › pdsw09 › resources › pdsw-pnfsmpiio-hildebrand-pu… · Good : MPI-IO and pNFS share similar relaxed semantics Bad : HPC

© 2009 IBM Corporation8

PDSW’09

HPC consistency requirements

� Different I/O workloads have different requirements

– Checkpoint• Write-only• No revalidation to new file

– Ingest/Restart• Read-only• Revalidation on Open

– sync-barrier-sync• Sync data to disk• Force revalidate/invalidate

– Others?

Page 9: pNFS, POSIX, and MPI-IO: A Tale of Three Semantics › pdsw09 › resources › pdsw-pnfsmpiio-hildebrand-pu… · Good : MPI-IO and pNFS share similar relaxed semantics Bad : HPC

© 2009 IBM Corporation9

PDSW’09

MPI-IO: sync-barrier-sync

� Applications sometimes need to share computed results among processes/nodes

� Want to avoid MPI atomic mode to improve performance– Requires enforcing strict consistency semantics– Writes must be immediately visible by other processes

� Allows compute nodes to synchronize I/O operations between themselves– Sync #1 guarantees that the data written by all nodes is transferred to storage.– Barrier ensures that writes on all nodes complete prior to reads.– Sync #2 guarantees that all transferred data is visible to all processes.

Page 10: pNFS, POSIX, and MPI-IO: A Tale of Three Semantics › pdsw09 › resources › pdsw-pnfsmpiio-hildebrand-pu… · Good : MPI-IO and pNFS share similar relaxed semantics Bad : HPC

© 2009 IBM Corporation10

PDSW’09

Sync-Barrier-Sync

Compute Nodes Storage SystemNetwork

1

2

3

Each node has data in its cache

Page 11: pNFS, POSIX, and MPI-IO: A Tale of Three Semantics › pdsw09 › resources › pdsw-pnfsmpiio-hildebrand-pu… · Good : MPI-IO and pNFS share similar relaxed semantics Bad : HPC

© 2009 IBM Corporation11

PDSW’09

Sync-Barrier-Sync

1. SYNC: Place written data on filing servers

Compute Nodes Storage SystemNetwork

1

2

3

Page 12: pNFS, POSIX, and MPI-IO: A Tale of Three Semantics › pdsw09 › resources › pdsw-pnfsmpiio-hildebrand-pu… · Good : MPI-IO and pNFS share similar relaxed semantics Bad : HPC

© 2009 IBM Corporation12

PDSW’09

Sync-Barrier-Sync

2. BARRIER: Clients wait for other clients to flush dirty data to the servers, ensuring that no client issues read requests until all clients have the same view of file contents.

Compute Nodes Storage SystemNetwork

1

2

3

Page 13: pNFS, POSIX, and MPI-IO: A Tale of Three Semantics › pdsw09 › resources › pdsw-pnfsmpiio-hildebrand-pu… · Good : MPI-IO and pNFS share similar relaxed semantics Bad : HPC

© 2009 IBM Corporation13

PDSW’09

Sync-Barrier-Sync

3. SYNC: Perform file revalidation by ensuring written data is visible to all nodes.

Compute Nodes Storage SystemNetwork

1

2

3

Page 14: pNFS, POSIX, and MPI-IO: A Tale of Three Semantics › pdsw09 › resources › pdsw-pnfsmpiio-hildebrand-pu… · Good : MPI-IO and pNFS share similar relaxed semantics Bad : HPC

© 2009 IBM Corporation14

PDSW’09

Sync-Barrier-Sync

Nodes read new data

Compute Nodes Storage SystemNetwork

1

2

3

Page 15: pNFS, POSIX, and MPI-IO: A Tale of Three Semantics › pdsw09 › resources › pdsw-pnfsmpiio-hildebrand-pu… · Good : MPI-IO and pNFS share similar relaxed semantics Bad : HPC

© 2009 IBM Corporation15

PDSW’09

Agenda

Motivation

pNFS

HPC consistency requirements

Protocol consistency semantics

NFSv3 ADIO driver

pNFS ADIO driver

Page 16: pNFS, POSIX, and MPI-IO: A Tale of Three Semantics › pdsw09 › resources › pdsw-pnfsmpiio-hildebrand-pu… · Good : MPI-IO and pNFS share similar relaxed semantics Bad : HPC

© 2009 IBM Corporation16

PDSW’09

Protocol consistency semantics

CloseFnctl ulockFsync

OpenFnctl lock“____“

pNFSClose-to-open

MPI_File_SyncMPI_File_Close

MPI_File_SyncMPI_File_Open

MPI-IO (non-atomic)

Fsync

Close

Open

Read

POSIXLast writer wins

Data FlushRevalidation

Notes:

� POSIX does not require revalidation primitive

� NFS lacks primitive to leverage per-object change attribute

Page 17: pNFS, POSIX, and MPI-IO: A Tale of Three Semantics › pdsw09 › resources › pdsw-pnfsmpiio-hildebrand-pu… · Good : MPI-IO and pNFS share similar relaxed semantics Bad : HPC

© 2009 IBM Corporation17

PDSW’09

Agenda

Motivation

pNFS

HPC consistency requirements

Protocol consistency semantics

NFSv3 ADIO driver

pNFS ADIO driver

Page 18: pNFS, POSIX, and MPI-IO: A Tale of Three Semantics › pdsw09 › resources › pdsw-pnfsmpiio-hildebrand-pu… · Good : MPI-IO and pNFS share similar relaxed semantics Bad : HPC

© 2009 IBM Corporation18

PDSW’09

NFSv3 ADIO driver

ROMIO/ADIO� ROMIO is an I/O implementation of MPI-IO

� ADIO is a portable parallel I/O API that allows file systems to implement MPI-IO semantics

� NFSv3 ADIO driver– Forcing NFSv3 to comply with MPI-IO semantics hurts performance– Performs multiple close/open or lock/locku to revalidate file

• Inconsistent NFSv3 implementations• Protocol and implementation problems with the NFSv3 lockd daemon.• Disallows attribute caching

� UFS ADIO driver– POSIX compliant file systems

Page 19: pNFS, POSIX, and MPI-IO: A Tale of Three Semantics › pdsw09 › resources › pdsw-pnfsmpiio-hildebrand-pu… · Good : MPI-IO and pNFS share similar relaxed semantics Bad : HPC

© 2009 IBM Corporation19

PDSW’09

POP-IO NFSv4/Ext3 Read and Write

0

10

20

30

40

50

60

70

80

90

100

2 4 16 32

Number of processes

I/O B

and

wid

th (

MB

/sec

)

Read UFS DriverRead NFS DriverWrite UFS DriverWrite NFS Driver

Page 20: pNFS, POSIX, and MPI-IO: A Tale of Three Semantics › pdsw09 › resources › pdsw-pnfsmpiio-hildebrand-pu… · Good : MPI-IO and pNFS share similar relaxed semantics Bad : HPC

© 2009 IBM Corporation20

PDSW’09

POP-IO pNFS/GPFS Read and Write

0

10

20

30

40

50

60

70

80

90

100

2 4 16 32

Number of processes

Ag

gre

gat

e I/O

Ban

dw

idth

(M

B/s

ec)

Read UFS Driver

Read NFS driver

0

10

20

30

40

50

60

70

80

90

100

2 4 16 32Number of processes

Agg

rega

te I/

O B

andw

idth

(M

B/s

ec)

Write UFS DriverWrite NFS Driver

READ WRITE

Page 21: pNFS, POSIX, and MPI-IO: A Tale of Three Semantics › pdsw09 › resources › pdsw-pnfsmpiio-hildebrand-pu… · Good : MPI-IO and pNFS share similar relaxed semantics Bad : HPC

© 2009 IBM Corporation21

PDSW’09

Agenda

Motivation

pNFS

HPC consistency requirements

Protocol consistency semantics

NFSv3 ADIO driver

pNFS ADIO driver

Page 22: pNFS, POSIX, and MPI-IO: A Tale of Three Semantics › pdsw09 › resources › pdsw-pnfsmpiio-hildebrand-pu… · Good : MPI-IO and pNFS share similar relaxed semantics Bad : HPC

© 2009 IBM Corporation22

PDSW’09

Looking forward: pNFS ADIO driver possibilities

� Some I/O workloads, e.g., checkpointing, require minimal data coherence– Try to optimize “all read” or “all write” workloads

� For applications that perform sync-barrier-sync:o HEC POSIX extensions for HPC

o O_LAZY, lazyio_propagate(), lazyio_synchronize()o Direct I/O

o No read or writeback cacheo Possibly only on read path

o Non-portable techniques:o Manually invalidate entire page cacheo Fadvise

o open/close and/or lock/ulocko Data must be written to disk

o User-space client with customized interfaceo Support?

o Others?

Page 23: pNFS, POSIX, and MPI-IO: A Tale of Three Semantics › pdsw09 › resources › pdsw-pnfsmpiio-hildebrand-pu… · Good : MPI-IO and pNFS share similar relaxed semantics Bad : HPC

© 2009 IBM Corporation23

PDSW’09

Summary

Good: MPI-IO and pNFS share similar relaxed semantics

Bad: HPC apps require on-demand file sync and revalidation (Lazy I/O)

Ugly: Interface to file system through POSIX interface

Lacks on-demand file revalidation

Need to investigate possible workarounds

Page 24: pNFS, POSIX, and MPI-IO: A Tale of Three Semantics › pdsw09 › resources › pdsw-pnfsmpiio-hildebrand-pu… · Good : MPI-IO and pNFS share similar relaxed semantics Bad : HPC

© 2009 IBM Corporation24

PDSW’09

Thank You!