Upload
avice-elliott
View
220
Download
0
Embed Size (px)
Citation preview
August 15, 2000Terry Jones
Lawrence Livermore National Laboratory
2TRJ Aug-15-2000 ScicomP 2000
Outline
Scope– Presentation aimed at scientific/technical app writers
– Many interesting topics skipped: fs reliability, fs maintenance, ... Brief Overview of GPFS Architecture
– Disk Drive Basics
– GPFS -vs- NFS -vs- JFS What’s New In GPFS
– libgpfs
– Metadata improvements
– Performance oriented changes Application Do’s and Don’ts
– 5 Tips that will keep your codes humming Recent Performance Measurements
– “You can’t put tailfins on data” -- H.B.
3TRJ Aug-15-2000 ScicomP 2000
Disk Drive Basics
Disk Speed is advancing, but slower than processor speed Disk ParallelismName Year Diameter
(inches)Capacity(Mbytes)
Seek(RPM)
Mean SeekTime(ms)
Density(Mbits/in2)
InterfaceSpeed
(Mbytes/sec)
CPUSpeed
(Intel’s Spec95)
IBM 3380 1981 14 645 3600 18.0 2.27 3.0 0.06Wren 7 1987 5.25 613 3600 16.5 15.1 5.0 0.14IBM 0661 1991 3.5 320 4320 12.6 19.4 10.0 0.82IBM DMDM-10340 1999 1.0 340 4500 15.0 242.0 5.2 35.60
Disks are mechanical thingies huge latencies
Performance is a function of buffer size Caching
4TRJ Aug-15-2000 ScicomP 2000
File System Architectures
SwitchI/ONodes
mmfsd app
vsd
Stripe Grp MgrToken Mgr Srvr
mmfsd
Metanode
mmfsd
ComputeNodes
GPFS Parallel File System• GPFS fs are striped across
multiple disks on multiple I/O nodes
• All compute nodes can share all disks
• Larger files, larger file systems, higher bandwidths
Unsecure Network
SingleStorage
ControllerClients
JFS (Native AIX fs)•No file sharing - on-node only•Apps must do own partitioning
Switch
NFS / DFS / AFS•Clients share files on server node•Server node becomes bottleneck
5TRJ Aug-15-2000 ScicomP 2000
New in Version 1.3
Multiple levels of indirect blocks => larger max file size, max number of files increases
SDR interaction improved VSD KLAPI (flow control, one less copy, relieves
usage of CSS send and receive pool) Prefetch algorithms now recognize strided & reverse
sequential access Batching => much better metadata performance
6TRJ Aug-15-2000 ScicomP 2000
New Programming Interfaces
Support for mmap functions as described in X/Open 4.2
libgpfs
– gpfs_prealloc() An external programming call, which provides the ability to preallocate space for a file, has been added to GPFS. This will allow you to preallocate an amount of space for a file that has already been opened, prior to writing data to the file (use to ensure sufficient space).
– gpfs_stat() and gpfs_fstat() Provide exact mtime and ctime.
– gpfs_getacl(), gpfs_putacl() gpfs_fgetattrs(), gpfs_putfattrs() ACLs for products like ADSM.
– gpfs_fcntl() Provide hints about access patterns. Used by MPI-IO. Can be used to set access range, clear cache, & invoke Data Shipping...
7TRJ Aug-15-2000 ScicomP 2000
GPFS Data Shipping
Not to be confused with IBM’s MPI-IO data shipping A mechanism for coalescing small I/Os on a single node for
disk efficiency (no distributed locking overhead, no false sharing).
It’s targeted at users who do not use MPI. It violates POSIX semantics and affects other users of the
same file. Operation:
– Data shipping entered via collective operation at file open time
– Normal access from other programs inhibited while file is in DS mode
– Each member of the collective assigned to serve a subset of blocks
– Reads and writes go through this server regardless of where they originate
– Data blocks only buffered at server, not at other clients
8TRJ Aug-15-2000 ScicomP 2000
Effect of Buffer Size
0
20
40
60
80
100
120
140
160
180
200
1 2 4 8 16 32 64 128 256 512 1024 2048 4096
Buffer Size (KB)
Bandwidth (MB/sec)
Writer Performance
Avoid small (less than 64K) reads and writes Use block-aligned records
Application Do’s and Don’ts #1
9TRJ Aug-15-2000 ScicomP 2000
Application Do’s and Don’ts #2
Avoid overlapped writes
8 GB file
File offsets 0 - 6.0 GB and 6.1 - 8.0 GBare accessed in parallel with no conflict
Conflict for 6.0 - 6.1 GB detected byLock manager, only one node may
Access this region at a time
Node 1 locksOffsets 0 - 2 GB
Node 2 locksOffsets 2 - 4 GB
Node 3 locksOffsets 4 - 6.1 GB
Node 4 locksOffsets 6 - 8 GB
node 1 node 2 node 3 node 4
10TRJ Aug-15-2000 ScicomP 2000
Application Do’s and Don’ts #3
Partition data so as to make a given client’s accesses largely sequential (e.g. segmented data layouts are optimal, strided requires more token management).
StridedSegmented
11TRJ Aug-15-2000 ScicomP 2000
Application Do’s and Don’ts #4
Use access patterns that can exploit prefetching and write-behind (e.g. sequential access is good, random access is bad). Use multiple opens if necessary.
Switch
I/ONode
mmfsd app
Stripe Grp MgrToken Mgr Srvr
mmfsd
Metanode
mmfsd
I/ONode
I/ONode1 4
2 5
3 6
GPFS stripes successive blocks of each file across successive disks. For instance, disks 1, 2, 3 may be read concurrently if the
access pattern is a match
12TRJ Aug-15-2000 ScicomP 2000
Application Do’s and Don’ts #5
Use MPI-IO
– Collective operations
– Predefined Data Layouts for memory & file
– Opportunities for coalescing and improved prefetch/write-behind
– (See notes from Dick Treumann’s MPI-IO talk)
13TRJ Aug-15-2000 ScicomP 2000
Read & Write Performance taken from ASCI White
machine
IBM RS/6000 SP System 4 switch frames (1 in use at
present) 4 Nighthawk-2 nodes per frame 16 375MHz 64-bit IBM Power3
CPUs per node Now: 120 Nodes & 1920 CPUs
Later: 512 Nodes & 8192 CPUs Now: 8 GBytes RAM per node
(0.96 TBytes total memory) Later: 266 @ 16 GB, 246 @ 8 GB (6.2 Tbytes total memory)
Now: Peak ~2.9 GFLOPS Later: Peak ~12.3 Tflops
Now: 4 I/O nodes (dedicated) Later: 16 I/O nodes (dedicated)
4 RIO optical extenders per node (machine requires too much floor space for copper SSA cabling)
4 SSA adapters per RIO 3 73-GByte RAID sets per SSA
adapter Now: 14.0 TB total disk space
Later: 150.0 TB total disk space 5 Disks on each RAID (4 + P)
– parity information is distributed among drives (i.e., not a dedicated parity drive)
14TRJ Aug-15-2000 ScicomP 2000
116 NH2 Compute NodesColony Switch(1 Switch/Frame)
ColonyAdapter
Santa Cruz
Each disk is an IBM SSA disk, max 8.5 MB/sec,typical 5.5 MB/sec, total capacity 18.2 GB.
Each Santa Cruz Adapter is capable of: ~48 MB/sec writes ~88 MB/sec reads
Pre-upgrade White I/O Diagram4*(16*(3*(4+1)))
Sixteen 375 Mhz Power 3
4 NH2 I/O NodesColonyAdapter
Sixteen 375 Mhz Power 3
RIO
RIO
RIO
RIO
Four 3500 MB/sec RIOs
Santa Cruz
Santa Cruz
Santa Cruz Disks are configured into a 4+p Raid Set (73 GB). Each Raid Set is capable of 14 MB/sec.
RIOEach RIO connectsTo 4 Santa Cruz Adapters
Colony Adapter:
Point-to-point: ~400 MB/sec Unidirectional
Switch comm is server bottleneck: ~1600 MB/sec total
Clients (VSDs) communicate withLAPI protocol
Speculation: Switch comm is client bottleneck: ~340 MB/sec/task
15TRJ Aug-15-2000 ScicomP 2000
Single Client Performance
Machine: white.llnl.govDedicated: yesDate: Aug, 2000PSSP: 3.2 + PTF 1AIX: 4.3.3 + PTF 16GPFS: 1.3I/O Config: 4*(16*(3*(4+1)))CPU: 375Mhz P3Node: NH-2 (16-way)ServerCache: 128*512KClientCache: 100 MBMetadataCache: 2000 filesGPFSprotocol: tcpVSDprotocol: K-LAPIGPFSblocksize: 512K
Bandwidth -vs- BufferSizeSingle Client, 2048 GB file
0
50
100
150
200
250
300
350
400
1 2 4 8 16 32 64 128 256 512 1024 2048 4096
BufferSize (KB)
Bandwidth (MB/sec)
Create
Read
16TRJ Aug-15-2000 ScicomP 2000
Performance Increase for 1x1 to 2x2
Machine: white.llnl.govDedicated: yesDate: Aug, 2000PSSP: 3.2 + PTF 1AIX: 4.3.3 + PTF 16GPFS: 1.3I/O Config: 4*(16*(3*(4+1)))CPU: 375Mhz P3Node: NH-2 (16-way)ServerCache: 128*512KClientCache: 100 MBMetadataCache: 2000 filesGPFSprotocol: tcpVSDprotocol: K-LAPIGPFSblocksize: 512K
Bandwidth -vs- BufferSizeSingle file, Segmented Layout, 512MB/client
0
100
200
300
400
500
600
700
800
900
1 2 4 8 16 32 64 128 256 512 1024 2048 4096
BufferSize (KB)
Bandwidth (MB/sec)
2x2 Read
2x2 Read
1x1 Create
1x1 Read
17TRJ Aug-15-2000 ScicomP 2000
Multi-Node Scalability
Machine: white.llnl.govDedicated: yesDate: Aug, 2000PSSP: 3.2 + PTF 1AIX: 4.3.3 + PTF 16GPFS: 1.3I/O Config: 4*(16*(3*(4+1)))CPU: 375Mhz P3Node: NH-2 (16-way)ServerCache: 128*512KClientCache: 100 MBMetadataCache: 2000 filesGPFSprotocol: tcpVSDprotocol: K-LAPIGPFSblocksize: 512K
NClients -vs- BandwidthSingle File, Segmented Layout, 512MB/client
0
200
400
600
800
1000
1200
1400
1600
1800
1 2 4 8 16 32 64 96
Number of Client Nodes
Bandwidth (MB/sec)
1 Create
1 Read
2 Create
2 Read
18TRJ Aug-15-2000 ScicomP 2000
On Node Scalability
Machine: white.llnl.govDedicated: yesDate: Aug, 2000PSSP: 3.2 + PTF 1AIX: 4.3.3 + PTF 16GPFS: 1.3I/O Config: 4*(16*(3*(4+1)))CPU: 375Mhz P3Node: NH-2 (16-way)ServerCache: 128*512KClientCache: 100 MBMetadataCache: 2000 filesGPFSprotocol: tcpVSDprotocol: K-LAPIGPFSblocksize: 512K
NTasks -vs- BandwidthTwo NH2 Nodes, Single File, Segmented Layout, 512MB/task
0
200
400
600
800
1000
1200
1400
1600
1 2 4 8 16
Number of Tasks
Bandwidth (MB/sec)
Create
Read
19TRJ Aug-15-2000 ScicomP 2000
Metadata Improvements #1Batching on Removes
Segmented Block Allocation Map Each segment contains bits representing
blocks on all disks, each is lockable Minimize contention for allocation map
Machine: snow.llnl.govDedicated: yesDate: Aug, 2000PSSP: 3.2 + PTF 1AIX: 4.3.3 + PTF 16GPFS: 1.3I/O Config: 2*(3*(3*(4+1)))CPU: 222Mhz P3Node: NH-1 (8-way)ServerCache: 64*512KClientCache: 20 MBMetadataCache: 1000 filesGPFSprotocol: tcpVSDprotocol: K-LAPIGPFSblocksize: 512K
Large File Removal Rate
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
1 2 4 8
# Clients
MB/s
GPFS 1.3GPFS 1.2NFSJFS
20TRJ Aug-15-2000 ScicomP 2000
Metadata Improvements #2Batching on Creates
Machine: snow.llnl.govDedicated: yesDate: Aug, 2000PSSP: 3.2 + PTF 1AIX: 4.3.3 + PTF 16GPFS: 1.3I/O Config: 2*(3*(3*(4+1)))CPU: 222Mhz P3
Node: NH-1 (8-way)ServerCache: 64*512KClientCache: 20 MBMetadataCache: 1000 filesGPFSprotocol: tcpVSDprotocol: K-LAPIGPFSblocksize: 512K
Small File Creation Rate1000 29-byte files per client
0
100
200
300
400
500
600
700
800
900
1 2 4 8
# Clients
File/s
GPFS 1.3GPFS 1.2NFSJFS
21TRJ Aug-15-2000 ScicomP 2000
Metadata Improvements #3Batching on Dir Operations
Machine: snow.llnl.govDedicated: yesDate: Aug, 2000PSSP: 3.2 + PTF 1AIX: 4.3.3 + PTF 16GPFS: 1.3I/O Config: 2*(3*(3*(4+1)))CPU: 222Mhz P3
Node: NH-1 (8-way)ServerCache: 64*512KClientCache: 20 MBMetadataCache: 1000 filesGPFSprotocol: tcpVSDprotocol: K-LAPIGPFSblocksize: 512K
Directory Creation Rate1 directory of depth 1000 per client
0
50
100
150
200
250
300
350
400
1 2 4 8
# Clients
Dir/s
GPFS 1.3GPFS 1.2NFSJFS
22TRJ Aug-15-2000 ScicomP 2000
“…and in conclusion…’’
Further Info Redbooks at http://www.redbooks.ibm.com (2, soon to be 3)
User guides at http://www.rs6000.ibm.com/resource/aix_resource/sp_books/gpfs/index.html
Europar 2000 paper on MPI-IO
Terry Jones, Alice Koniges, Kim Yates, “Performance of the IBM General Parallel File System”, Proc. International Parallel and Distributed Processing Symposium, May 2000.
Message Passing Interface Forum, “MPI-2: A Message Passing Interface Standard”, Standards Document 2.0, University of Tennessee, Knoxville, July 1997.
Acknowledgements The metadata measurements (file creation & file deletion) are due to our summer intern, Bill Loewe IBMers who reviewed this document for accuracy: Roger Haskin, Lyle Gayne, Bob Curran, Dan McNabb
This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48.