August 15, 2000 Terry Jones Lawrence Livermore National Laboratory

August 15, 2000Terry Jones

Lawrence Livermore National Laboratory

2TRJ Aug-15-2000 ScicomP 2000

Outline

Scope– Presentation aimed at scientific/technical app writers

– Many interesting topics skipped: fs reliability, fs maintenance, ... Brief Overview of GPFS Architecture

– Disk Drive Basics

– GPFS -vs- NFS -vs- JFS What’s New In GPFS

– libgpfs

– Metadata improvements

– Performance oriented changes Application Do’s and Don’ts

– 5 Tips that will keep your codes humming Recent Performance Measurements

– “You can’t put tailfins on data” -- H.B.


Disk Drive Basics

Disk Speed is advancing, but slower than processor speed Disk ParallelismName Year Diameter

(inches)Capacity(Mbytes)

Seek(RPM)

Mean SeekTime(ms)

Density(Mbits/in2)

InterfaceSpeed

(Mbytes/sec)

CPUSpeed

(Intel’s Spec95)

IBM 3380 1981 14 645 3600 18.0 2.27 3.0 0.06Wren 7 1987 5.25 613 3600 16.5 15.1 5.0 0.14IBM 0661 1991 3.5 320 4320 12.6 19.4 10.0 0.82IBM DMDM-10340 1999 1.0 340 4500 15.0 242.0 5.2 35.60

Disks are mechanical thingies huge latencies

Performance is a function of buffer size Caching


File System Architectures

SwitchI/ONodes

mmfsd app

vsd

Stripe Grp MgrToken Mgr Srvr

mmfsd

Metanode

mmfsd

ComputeNodes

GPFS Parallel File System• GPFS fs are striped across

multiple disks on multiple I/O nodes

• All compute nodes can share all disks

• Larger files, larger file systems, higher bandwidths

Unsecure Network

SingleStorage

ControllerClients

JFS (Native AIX fs)•No file sharing - on-node only•Apps must do own partitioning

Switch

NFS / DFS / AFS•Clients share files on server node•Server node becomes bottleneck


New in Version 1.3

Multiple levels of indirect blocks => larger max file size, max number of files increases

SDR interaction improved VSD KLAPI (flow control, one less copy, relieves

usage of CSS send and receive pool) Prefetch algorithms now recognize strided & reverse

sequential access Batching => much better metadata performance


New Programming Interfaces

Support for mmap functions as described in X/Open 4.2

libgpfs

– gpfs_prealloc() An external programming call, which provides the ability to preallocate space for a file, has been added to GPFS. This will allow you to preallocate an amount of space for a file that has already been opened, prior to writing data to the file (use to ensure sufficient space).

– gpfs_stat() and gpfs_fstat() Provide exact mtime and ctime.

– gpfs_getacl(), gpfs_putacl() gpfs_fgetattrs(), gpfs_putfattrs() ACLs for products like ADSM.

– gpfs_fcntl() Provide hints about access patterns. Used by MPI-IO. Can be used to set access range, clear cache, & invoke Data Shipping...


GPFS Data Shipping

Not to be confused with IBM’s MPI-IO data shipping A mechanism for coalescing small I/Os on a single node for

disk efficiency (no distributed locking overhead, no false sharing).

It’s targeted at users who do not use MPI. It violates POSIX semantics and affects other users of the

same file. Operation:

– Data shipping entered via collective operation at file open time

– Normal access from other programs inhibited while file is in DS mode

– Each member of the collective assigned to serve a subset of blocks

– Reads and writes go through this server regardless of where they originate

– Data blocks only buffered at server, not at other clients


Effect of Buffer Size

0

20

40

60

80

100

120

140

160

180

200

1 2 4 8 16 32 64 128 256 512 1024 2048 4096

Buffer Size (KB)

Bandwidth (MB/sec)

Writer Performance

Avoid small (less than 64K) reads and writes Use block-aligned records

Application Do’s and Don’ts #1



Avoid overlapped writes

8 GB file

File offsets 0 - 6.0 GB and 6.1 - 8.0 GBare accessed in parallel with no conflict

Conflict for 6.0 - 6.1 GB detected byLock manager, only one node may

Access this region at a time

Node 1 locksOffsets 0 - 2 GB


Node 3 locksOffsets 4 - 6.1 GB


node 1 node 2 node 3 node 4



Partition data so as to make a given client’s accesses largely sequential (e.g. segmented data layouts are optimal, strided requires more token management).

StridedSegmented



Use access patterns that can exploit prefetching and write-behind (e.g. sequential access is good, random access is bad). Use multiple opens if necessary.

Switch

I/ONode

mmfsd app

Stripe Grp MgrToken Mgr Srvr

mmfsd

Metanode

mmfsd

I/ONode

I/ONode1 4

2 5

3 6

GPFS stripes successive blocks of each file across successive disks. For instance, disks 1, 2, 3 may be read concurrently if the

access pattern is a match



Use MPI-IO

– Collective operations

– Predefined Data Layouts for memory & file

– Opportunities for coalescing and improved prefetch/write-behind

– (See notes from Dick Treumann’s MPI-IO talk)


Read & Write Performance taken from ASCI White

machine

IBM RS/6000 SP System 4 switch frames (1 in use at

present) 4 Nighthawk-2 nodes per frame 16 375MHz 64-bit IBM Power3

CPUs per node Now: 120 Nodes & 1920 CPUs

Later: 512 Nodes & 8192 CPUs Now: 8 GBytes RAM per node

(0.96 TBytes total memory) Later: 266 @ 16 GB, 246 @ 8 GB (6.2 Tbytes total memory)

Now: Peak ~2.9 GFLOPS Later: Peak ~12.3 Tflops

Now: 4 I/O nodes (dedicated) Later: 16 I/O nodes (dedicated)

4 RIO optical extenders per node (machine requires too much floor space for copper SSA cabling)

4 SSA adapters per RIO 3 73-GByte RAID sets per SSA

adapter Now: 14.0 TB total disk space

Later: 150.0 TB total disk space 5 Disks on each RAID (4 + P)

– parity information is distributed among drives (i.e., not a dedicated parity drive)


116 NH2 Compute NodesColony Switch(1 Switch/Frame)

ColonyAdapter

Santa Cruz

Each disk is an IBM SSA disk, max 8.5 MB/sec,typical 5.5 MB/sec, total capacity 18.2 GB.

Each Santa Cruz Adapter is capable of: ~48 MB/sec writes ~88 MB/sec reads

Pre-upgrade White I/O Diagram4*(16*(3*(4+1)))

Sixteen 375 Mhz Power 3

4 NH2 I/O NodesColonyAdapter

Sixteen 375 Mhz Power 3

RIO

RIO

RIO

RIO

Four 3500 MB/sec RIOs

Santa Cruz

Santa Cruz

Santa Cruz Disks are configured into a 4+p Raid Set (73 GB). Each Raid Set is capable of 14 MB/sec.

RIOEach RIO connectsTo 4 Santa Cruz Adapters

Colony Adapter:

Point-to-point: ~400 MB/sec Unidirectional

Switch comm is server bottleneck: ~1600 MB/sec total

Clients (VSDs) communicate withLAPI protocol

Speculation: Switch comm is client bottleneck: ~340 MB/sec/task


Single Client Performance

Machine: white.llnl.govDedicated: yesDate: Aug, 2000PSSP: 3.2 + PTF 1AIX: 4.3.3 + PTF 16GPFS: 1.3I/O Config: 4*(16*(3*(4+1)))CPU: 375Mhz P3Node: NH-2 (16-way)ServerCache: 128*512KClientCache: 100 MBMetadataCache: 2000 filesGPFSprotocol: tcpVSDprotocol: K-LAPIGPFSblocksize: 512K

Bandwidth -vs- BufferSizeSingle Client, 2048 GB file

0

50

100

150

200

250

300

350

400

1 2 4 8 16 32 64 128 256 512 1024 2048 4096

BufferSize (KB)

Bandwidth (MB/sec)

Create

Read


Performance Increase for 1x1 to 2x2


Bandwidth -vs- BufferSizeSingle file, Segmented Layout, 512MB/client

0

100

200

300

400

500

600

700

800

900

1 2 4 8 16 32 64 128 256 512 1024 2048 4096

BufferSize (KB)

Bandwidth (MB/sec)

2x2 Read

2x2 Read

1x1 Create

1x1 Read


Multi-Node Scalability


NClients -vs- BandwidthSingle File, Segmented Layout, 512MB/client

0

200

400

600

800

1000

1200

1400

1600

1800

1 2 4 8 16 32 64 96

Number of Client Nodes

Bandwidth (MB/sec)

1 Create

1 Read

2 Create

2 Read


On Node Scalability


NTasks -vs- BandwidthTwo NH2 Nodes, Single File, Segmented Layout, 512MB/task

0

200

400

600

800

1000

1200

1400

1600

1 2 4 8 16

Number of Tasks

Bandwidth (MB/sec)

Create

Read


Metadata Improvements #1Batching on Removes

Segmented Block Allocation Map Each segment contains bits representing

blocks on all disks, each is lockable Minimize contention for allocation map

Machine: snow.llnl.govDedicated: yesDate: Aug, 2000PSSP: 3.2 + PTF 1AIX: 4.3.3 + PTF 16GPFS: 1.3I/O Config: 2*(3*(3*(4+1)))CPU: 222Mhz P3Node: NH-1 (8-way)ServerCache: 64*512KClientCache: 20 MBMetadataCache: 1000 filesGPFSprotocol: tcpVSDprotocol: K-LAPIGPFSblocksize: 512K

Large File Removal Rate

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

1 2 4 8

# Clients

MB/s

GPFS 1.3GPFS 1.2NFSJFS


Metadata Improvements #2Batching on Creates

Machine: snow.llnl.govDedicated: yesDate: Aug, 2000PSSP: 3.2 + PTF 1AIX: 4.3.3 + PTF 16GPFS: 1.3I/O Config: 2*(3*(3*(4+1)))CPU: 222Mhz P3

Node: NH-1 (8-way)ServerCache: 64*512KClientCache: 20 MBMetadataCache: 1000 filesGPFSprotocol: tcpVSDprotocol: K-LAPIGPFSblocksize: 512K

Small File Creation Rate1000 29-byte files per client

0

100

200

300

400

500

600

700

800

900

1 2 4 8

# Clients

File/s



Metadata Improvements #3Batching on Dir Operations

Machine: snow.llnl.govDedicated: yesDate: Aug, 2000PSSP: 3.2 + PTF 1AIX: 4.3.3 + PTF 16GPFS: 1.3I/O Config: 2*(3*(3*(4+1)))CPU: 222Mhz P3

Node: NH-1 (8-way)ServerCache: 64*512KClientCache: 20 MBMetadataCache: 1000 filesGPFSprotocol: tcpVSDprotocol: K-LAPIGPFSblocksize: 512K

Directory Creation Rate1 directory of depth 1000 per client

0

50

100

150

200

250

300

350

400

1 2 4 8

# Clients

Dir/s



“…and in conclusion…’’

Further Info Redbooks at http://www.redbooks.ibm.com (2, soon to be 3)

User guides at http://www.rs6000.ibm.com/resource/aix_resource/sp_books/gpfs/index.html

Europar 2000 paper on MPI-IO

Terry Jones, Alice Koniges, Kim Yates, “Performance of the IBM General Parallel File System”, Proc. International Parallel and Distributed Processing Symposium, May 2000.

Message Passing Interface Forum, “MPI-2: A Message Passing Interface Standard”, Standards Document 2.0, University of Tennessee, Knoxville, July 1997.

Acknowledgements The metadata measurements (file creation & file deletion) are due to our summer intern, Bill Loewe IBMers who reviewed this document for accuracy: Roger Haskin, Lyle Gayne, Bob Curran, Dan McNabb

This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48.

Documents

August 15, 2000 Terry Jones Lawrence Livermore National Laboratory