22
August 15, 2000 Terry Jones Lawrence Livermore National Laboratory

August 15, 2000 Terry Jones Lawrence Livermore National Laboratory

Embed Size (px)

Citation preview

Page 1: August 15, 2000 Terry Jones Lawrence Livermore National Laboratory

August 15, 2000Terry Jones

Lawrence Livermore National Laboratory

Page 2: August 15, 2000 Terry Jones Lawrence Livermore National Laboratory

2TRJ Aug-15-2000 ScicomP 2000

Outline

Scope– Presentation aimed at scientific/technical app writers

– Many interesting topics skipped: fs reliability, fs maintenance, ... Brief Overview of GPFS Architecture

– Disk Drive Basics

– GPFS -vs- NFS -vs- JFS What’s New In GPFS

– libgpfs

– Metadata improvements

– Performance oriented changes Application Do’s and Don’ts

– 5 Tips that will keep your codes humming Recent Performance Measurements

– “You can’t put tailfins on data” -- H.B.

Page 3: August 15, 2000 Terry Jones Lawrence Livermore National Laboratory

3TRJ Aug-15-2000 ScicomP 2000

Disk Drive Basics

Disk Speed is advancing, but slower than processor speed Disk ParallelismName Year Diameter

(inches)Capacity(Mbytes)

Seek(RPM)

Mean SeekTime(ms)

Density(Mbits/in2)

InterfaceSpeed

(Mbytes/sec)

CPUSpeed

(Intel’s Spec95)

IBM 3380 1981 14 645 3600 18.0 2.27 3.0 0.06Wren 7 1987 5.25 613 3600 16.5 15.1 5.0 0.14IBM 0661 1991 3.5 320 4320 12.6 19.4 10.0 0.82IBM DMDM-10340 1999 1.0 340 4500 15.0 242.0 5.2 35.60

Disks are mechanical thingies huge latencies

Performance is a function of buffer size Caching

Page 4: August 15, 2000 Terry Jones Lawrence Livermore National Laboratory

4TRJ Aug-15-2000 ScicomP 2000

File System Architectures

SwitchI/ONodes

mmfsd app

vsd

Stripe Grp MgrToken Mgr Srvr

mmfsd

Metanode

mmfsd

ComputeNodes

GPFS Parallel File System• GPFS fs are striped across

multiple disks on multiple I/O nodes

• All compute nodes can share all disks

• Larger files, larger file systems, higher bandwidths

Unsecure Network

SingleStorage

ControllerClients

JFS (Native AIX fs)•No file sharing - on-node only•Apps must do own partitioning

Switch

NFS / DFS / AFS•Clients share files on server node•Server node becomes bottleneck

Page 5: August 15, 2000 Terry Jones Lawrence Livermore National Laboratory

5TRJ Aug-15-2000 ScicomP 2000

New in Version 1.3

Multiple levels of indirect blocks => larger max file size, max number of files increases

SDR interaction improved VSD KLAPI (flow control, one less copy, relieves

usage of CSS send and receive pool) Prefetch algorithms now recognize strided & reverse

sequential access Batching => much better metadata performance

Page 6: August 15, 2000 Terry Jones Lawrence Livermore National Laboratory

6TRJ Aug-15-2000 ScicomP 2000

New Programming Interfaces

Support for mmap functions as described in X/Open 4.2

libgpfs

– gpfs_prealloc() An external programming call, which provides the ability to preallocate space for a file, has been added to GPFS. This will allow you to preallocate an amount of space for a file that has already been opened, prior to writing data to the file (use to ensure sufficient space).

– gpfs_stat() and gpfs_fstat() Provide exact mtime and ctime.

– gpfs_getacl(), gpfs_putacl() gpfs_fgetattrs(), gpfs_putfattrs() ACLs for products like ADSM.

– gpfs_fcntl() Provide hints about access patterns. Used by MPI-IO. Can be used to set access range, clear cache, & invoke Data Shipping...

Page 7: August 15, 2000 Terry Jones Lawrence Livermore National Laboratory

7TRJ Aug-15-2000 ScicomP 2000

GPFS Data Shipping

Not to be confused with IBM’s MPI-IO data shipping A mechanism for coalescing small I/Os on a single node for

disk efficiency (no distributed locking overhead, no false sharing).

It’s targeted at users who do not use MPI. It violates POSIX semantics and affects other users of the

same file. Operation:

– Data shipping entered via collective operation at file open time

– Normal access from other programs inhibited while file is in DS mode

– Each member of the collective assigned to serve a subset of blocks

– Reads and writes go through this server regardless of where they originate

– Data blocks only buffered at server, not at other clients

Page 8: August 15, 2000 Terry Jones Lawrence Livermore National Laboratory

8TRJ Aug-15-2000 ScicomP 2000

Effect of Buffer Size

0

20

40

60

80

100

120

140

160

180

200

1 2 4 8 16 32 64 128 256 512 1024 2048 4096

Buffer Size (KB)

Bandwidth (MB/sec)

Writer Performance

Avoid small (less than 64K) reads and writes Use block-aligned records

Application Do’s and Don’ts #1

Page 9: August 15, 2000 Terry Jones Lawrence Livermore National Laboratory

9TRJ Aug-15-2000 ScicomP 2000

Application Do’s and Don’ts #2

Avoid overlapped writes

8 GB file

File offsets 0 - 6.0 GB and 6.1 - 8.0 GBare accessed in parallel with no conflict

Conflict for 6.0 - 6.1 GB detected byLock manager, only one node may

Access this region at a time

Node 1 locksOffsets 0 - 2 GB

Node 2 locksOffsets 2 - 4 GB

Node 3 locksOffsets 4 - 6.1 GB

Node 4 locksOffsets 6 - 8 GB

node 1 node 2 node 3 node 4

Page 10: August 15, 2000 Terry Jones Lawrence Livermore National Laboratory

10TRJ Aug-15-2000 ScicomP 2000

Application Do’s and Don’ts #3

Partition data so as to make a given client’s accesses largely sequential (e.g. segmented data layouts are optimal, strided requires more token management).

StridedSegmented

Page 11: August 15, 2000 Terry Jones Lawrence Livermore National Laboratory

11TRJ Aug-15-2000 ScicomP 2000

Application Do’s and Don’ts #4

Use access patterns that can exploit prefetching and write-behind (e.g. sequential access is good, random access is bad). Use multiple opens if necessary.

Switch

I/ONode

mmfsd app

Stripe Grp MgrToken Mgr Srvr

mmfsd

Metanode

mmfsd

I/ONode

I/ONode1 4

2 5

3 6

GPFS stripes successive blocks of each file across successive disks. For instance, disks 1, 2, 3 may be read concurrently if the

access pattern is a match

Page 12: August 15, 2000 Terry Jones Lawrence Livermore National Laboratory

12TRJ Aug-15-2000 ScicomP 2000

Application Do’s and Don’ts #5

Use MPI-IO

– Collective operations

– Predefined Data Layouts for memory & file

– Opportunities for coalescing and improved prefetch/write-behind

– (See notes from Dick Treumann’s MPI-IO talk)

Page 13: August 15, 2000 Terry Jones Lawrence Livermore National Laboratory

13TRJ Aug-15-2000 ScicomP 2000

Read & Write Performance taken from ASCI White

machine

IBM RS/6000 SP System 4 switch frames (1 in use at

present) 4 Nighthawk-2 nodes per frame 16 375MHz 64-bit IBM Power3

CPUs per node Now: 120 Nodes & 1920 CPUs

Later: 512 Nodes & 8192 CPUs Now: 8 GBytes RAM per node

(0.96 TBytes total memory) Later: 266 @ 16 GB, 246 @ 8 GB (6.2 Tbytes total memory) 

Now: Peak ~2.9 GFLOPS Later: Peak ~12.3 Tflops

Now: 4 I/O nodes (dedicated) Later: 16 I/O nodes (dedicated)

4 RIO optical extenders per node (machine requires too much floor space for copper SSA cabling)

4 SSA adapters per RIO 3 73-GByte RAID sets per SSA

adapter Now: 14.0 TB total disk space

Later: 150.0 TB total disk space 5 Disks on each RAID (4 + P)

– parity information is distributed among drives (i.e., not a dedicated parity drive) 

Page 14: August 15, 2000 Terry Jones Lawrence Livermore National Laboratory

14TRJ Aug-15-2000 ScicomP 2000

116 NH2 Compute NodesColony Switch(1 Switch/Frame)

ColonyAdapter

Santa Cruz

Each disk is an IBM SSA disk, max 8.5 MB/sec,typical 5.5 MB/sec, total capacity 18.2 GB.

Each Santa Cruz Adapter is capable of: ~48 MB/sec writes ~88 MB/sec reads

Pre-upgrade White I/O Diagram4*(16*(3*(4+1)))

Sixteen 375 Mhz Power 3

4 NH2 I/O NodesColonyAdapter

Sixteen 375 Mhz Power 3

RIO

RIO

RIO

RIO

Four 3500 MB/sec RIOs

Santa Cruz

Santa Cruz

Santa Cruz Disks are configured into a 4+p Raid Set (73 GB). Each Raid Set is capable of 14 MB/sec.

RIOEach RIO connectsTo 4 Santa Cruz Adapters

Colony Adapter:

Point-to-point: ~400 MB/sec Unidirectional

Switch comm is server bottleneck: ~1600 MB/sec total

Clients (VSDs) communicate withLAPI protocol

Speculation: Switch comm is client bottleneck: ~340 MB/sec/task

Page 15: August 15, 2000 Terry Jones Lawrence Livermore National Laboratory

15TRJ Aug-15-2000 ScicomP 2000

Single Client Performance

Machine: white.llnl.govDedicated: yesDate: Aug, 2000PSSP: 3.2 + PTF 1AIX: 4.3.3 + PTF 16GPFS: 1.3I/O Config: 4*(16*(3*(4+1)))CPU: 375Mhz P3Node: NH-2 (16-way)ServerCache: 128*512KClientCache: 100 MBMetadataCache: 2000 filesGPFSprotocol: tcpVSDprotocol: K-LAPIGPFSblocksize: 512K

Bandwidth -vs- BufferSizeSingle Client, 2048 GB file

0

50

100

150

200

250

300

350

400

1 2 4 8 16 32 64 128 256 512 1024 2048 4096

BufferSize (KB)

Bandwidth (MB/sec)

Create

Read

Page 16: August 15, 2000 Terry Jones Lawrence Livermore National Laboratory

16TRJ Aug-15-2000 ScicomP 2000

Performance Increase for 1x1 to 2x2

Machine: white.llnl.govDedicated: yesDate: Aug, 2000PSSP: 3.2 + PTF 1AIX: 4.3.3 + PTF 16GPFS: 1.3I/O Config: 4*(16*(3*(4+1)))CPU: 375Mhz P3Node: NH-2 (16-way)ServerCache: 128*512KClientCache: 100 MBMetadataCache: 2000 filesGPFSprotocol: tcpVSDprotocol: K-LAPIGPFSblocksize: 512K

Bandwidth -vs- BufferSizeSingle file, Segmented Layout, 512MB/client

0

100

200

300

400

500

600

700

800

900

1 2 4 8 16 32 64 128 256 512 1024 2048 4096

BufferSize (KB)

Bandwidth (MB/sec)

2x2 Read

2x2 Read

1x1 Create

1x1 Read

Page 17: August 15, 2000 Terry Jones Lawrence Livermore National Laboratory

17TRJ Aug-15-2000 ScicomP 2000

Multi-Node Scalability

Machine: white.llnl.govDedicated: yesDate: Aug, 2000PSSP: 3.2 + PTF 1AIX: 4.3.3 + PTF 16GPFS: 1.3I/O Config: 4*(16*(3*(4+1)))CPU: 375Mhz P3Node: NH-2 (16-way)ServerCache: 128*512KClientCache: 100 MBMetadataCache: 2000 filesGPFSprotocol: tcpVSDprotocol: K-LAPIGPFSblocksize: 512K

NClients -vs- BandwidthSingle File, Segmented Layout, 512MB/client

0

200

400

600

800

1000

1200

1400

1600

1800

1 2 4 8 16 32 64 96

Number of Client Nodes

Bandwidth (MB/sec)

1 Create

1 Read

2 Create

2 Read

Page 18: August 15, 2000 Terry Jones Lawrence Livermore National Laboratory

18TRJ Aug-15-2000 ScicomP 2000

On Node Scalability

Machine: white.llnl.govDedicated: yesDate: Aug, 2000PSSP: 3.2 + PTF 1AIX: 4.3.3 + PTF 16GPFS: 1.3I/O Config: 4*(16*(3*(4+1)))CPU: 375Mhz P3Node: NH-2 (16-way)ServerCache: 128*512KClientCache: 100 MBMetadataCache: 2000 filesGPFSprotocol: tcpVSDprotocol: K-LAPIGPFSblocksize: 512K

NTasks -vs- BandwidthTwo NH2 Nodes, Single File, Segmented Layout, 512MB/task

0

200

400

600

800

1000

1200

1400

1600

1 2 4 8 16

Number of Tasks

Bandwidth (MB/sec)

Create

Read

Page 19: August 15, 2000 Terry Jones Lawrence Livermore National Laboratory

19TRJ Aug-15-2000 ScicomP 2000

Metadata Improvements #1Batching on Removes

Segmented Block Allocation Map Each segment contains bits representing

blocks on all disks, each is lockable Minimize contention for allocation map

Machine: snow.llnl.govDedicated: yesDate: Aug, 2000PSSP: 3.2 + PTF 1AIX: 4.3.3 + PTF 16GPFS: 1.3I/O Config: 2*(3*(3*(4+1)))CPU: 222Mhz P3Node: NH-1 (8-way)ServerCache: 64*512KClientCache: 20 MBMetadataCache: 1000 filesGPFSprotocol: tcpVSDprotocol: K-LAPIGPFSblocksize: 512K

Large File Removal Rate

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

1 2 4 8

# Clients

MB/s

GPFS 1.3GPFS 1.2NFSJFS

Page 20: August 15, 2000 Terry Jones Lawrence Livermore National Laboratory

20TRJ Aug-15-2000 ScicomP 2000

Metadata Improvements #2Batching on Creates

Machine: snow.llnl.govDedicated: yesDate: Aug, 2000PSSP: 3.2 + PTF 1AIX: 4.3.3 + PTF 16GPFS: 1.3I/O Config: 2*(3*(3*(4+1)))CPU: 222Mhz P3

Node: NH-1 (8-way)ServerCache: 64*512KClientCache: 20 MBMetadataCache: 1000 filesGPFSprotocol: tcpVSDprotocol: K-LAPIGPFSblocksize: 512K

Small File Creation Rate1000 29-byte files per client

0

100

200

300

400

500

600

700

800

900

1 2 4 8

# Clients

File/s

GPFS 1.3GPFS 1.2NFSJFS

Page 21: August 15, 2000 Terry Jones Lawrence Livermore National Laboratory

21TRJ Aug-15-2000 ScicomP 2000

Metadata Improvements #3Batching on Dir Operations

Machine: snow.llnl.govDedicated: yesDate: Aug, 2000PSSP: 3.2 + PTF 1AIX: 4.3.3 + PTF 16GPFS: 1.3I/O Config: 2*(3*(3*(4+1)))CPU: 222Mhz P3

Node: NH-1 (8-way)ServerCache: 64*512KClientCache: 20 MBMetadataCache: 1000 filesGPFSprotocol: tcpVSDprotocol: K-LAPIGPFSblocksize: 512K

Directory Creation Rate1 directory of depth 1000 per client

0

50

100

150

200

250

300

350

400

1 2 4 8

# Clients

Dir/s

GPFS 1.3GPFS 1.2NFSJFS

Page 22: August 15, 2000 Terry Jones Lawrence Livermore National Laboratory

22TRJ Aug-15-2000 ScicomP 2000

“…and in conclusion…’’

Further Info Redbooks at http://www.redbooks.ibm.com (2, soon to be 3)

User guides at http://www.rs6000.ibm.com/resource/aix_resource/sp_books/gpfs/index.html

Europar 2000 paper on MPI-IO

Terry Jones, Alice Koniges, Kim Yates, “Performance of the IBM General Parallel File System”, Proc. International Parallel and Distributed Processing Symposium, May 2000.

Message Passing Interface Forum, “MPI-2: A Message Passing Interface Standard”, Standards Document 2.0, University of Tennessee, Knoxville, July 1997.

Acknowledgements The metadata measurements (file creation & file deletion) are due to our summer intern, Bill Loewe IBMers who reviewed this document for accuracy: Roger Haskin, Lyle Gayne, Bob Curran, Dan McNabb

This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48.