Collective Buffering: Improving Parallel I/O Performance

Collective Buffering: Improving Parallel I/O PerformanceByBill Nitzberg and Virginia Lo

OutlineIntroductionConceptsCollective parallel I/O algorithmsCollective buffering experimentsConclusionQuestion

IntroductionExisting parallel I/O system evolved directly from I/O system for serial machinesSerial I/O systems are heavily tuned for:Sequential, large accesses, limited file sharing between processesHigh degree of both spatial and temporal locality

Introduction (cont.)This paper presents a set of algorithms known as Collective Buffering algorithmsThese algorithms seeks to improve I/O performance on distributed memory machines by utilizing global knowledge of the I/O operations

ConceptsGlobal data structureGlobal data structure is the logical view of the data from the applications point of viewScientific applications generally use global data structures consisting of arrays distributed in one, two, or three dimensions

Concepts (cont.)Data distributionThe global data structure is distributed among node memories by cutting it into data chunks.The HPF BLOCK distribution partitions the global data structure into P equally sized piecesThe HPF CYCLIC divides the global data structure into small pieces (by distribution size or block size) and deals these pieces out to the P nodes in a round-robin fashion

Concepts (cont.)

Concepts (cont.)File layoutFile layout is another form of data distributionThe file represents a linearization of the global data structures, such as the row-major ordering of a three-dimensional arrayThis linearization is called canonical fileThe file are distributed among I/O nodes

Concepts (cont.)

Collective parallel I/O algorithmNave algorithmNave algorithm treats parallel I/O the same as workstation I/OThe order of writes is dependent on data layout in nodes memory which as no relation to the layout of data on disksThe unit of data transferred in each I/O operation is the data block the smallest unit of local data that is contiguous with respect to the canonical file

Collective parallel I/O algorithm (cont.)Nave algorithm (cont.)The size of the data block is very small and is unrelated to the size of a file block because of the disparity between data distributions and file layout parametersThe overall effect are:The network is flood with many small messagesMessages arrive at I/O nodes in an uncoordinated fashion resulting in highly inefficient disk writes

Collective parallel I/O algorithms (cont.)

Collective parallel I/O algorithms (cont.)Collective buffering algorithmThis method rearranges the data on compute nodes prior to issuance of I/O operations to minimize the number of disk operationsThe permutation can be performed in place where nodes transpose data among them selfIt can also be performed on auxiliary nodes where the compute nodes transpose the data by sending it to a set of auxiliary buffering nodes

Collective parallel I/O algorithms (cont.)

Collective parallel I/O algorithms (cont.)Four techniques are developed and evaluated:1 - All compute nodes are used to permute the data to a simple HPF BLOCK intermediate distribution in a single step2 Refine the first technique by realistically limiting the amount of buffer space and using a distribution which matches the file layout

Collective parallel I/O algorithms (cont.)Four techniques (cont.):This technique uses HPF CYCLIC intermediate distributionThis method uses scatter/gather hardware to eliminate the latency dominated overhead of the permutation phase

Collective buffering experimentsExperiment systems:The Paragon consists of 224 processing nodes connected in a 16x32 mesh. Application space-share 208 compute nodes with 32 MB of memory each. Nine I/O nodes each with one SCSI-1 RAID-3 disk array consisting of 5 disks, 2 gigabytes each.The parallel file system, PFS is configured to use 6 of the 9 I/O nodes

Collective buffering experiments (cont.)Experiments systems:The SP2 consists of 160 nodes. Each node is an IBM RS6000/590 with 128 MB of memory and a SCSI-1 attached 2 GB diskThe Parallel file system, IBM AIX Parallel I/O File System (PIOFS) is configured with 8 I/O nodes (semi-dedicated servers) and 150 compute nodes

Collective buffering experiments (cont.)

ConclusionCollective buffering significantly improves Nave parallel I/O performance by two orders of magnitude for small data block sizesPeak performance can be obtained with minimal buffer space (approximately 1 megabyte per I/O node)Performance is dependent on intermediate distribution (up to a factor of 2)

Conclusion (cont.)There is no single intermediate distribution which provides the best performance for all cases, but a few come closeCollective buffering with scatter/gather can potentially deliver peak performance for all data block sizes.

QuestionWhat is the advantages and disadvantages of the Nave algorithm ?What is Collective Buffering and how this technique may improve parallel I/O performance ?

Documents

Collective Buffering: Improving Parallel I/O Performance