Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
January 19, 2005 Page 2| |
Module Objectives
After you complete this module, you will be able to:–Profile I/O from an application–Use lsof command to see what files are open–Use strace to determine I/O characteristics–List common I/O system calls–Determine what types of I/O an application uses–Explain the advantages of direct I/O or buffered I/O
–Determine default library buffer sizes–Use FFIO to modify application I/O–Use MPI-I/O essentials to modify application I/O
January 19, 2005 Page 3| |
Characterize Application I/O
–Read vs write ratio–Transfer size–Positioning Sequential vs Random–Buffered vs Direct–sync, async–Formatted vs unformatted–Memory mapped mmap(2)–read-write-write–Bandwidth vs IOPs vs metadata–fsync–Parallel I/O
January 19, 2005 Page 4| |
lsof Report
linux% /usr/sbin/lsof | fgrep mpi_IO
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
.
.
mpi_IO 18066 reiner 0u CHR 136,0 2 /dev/pts/0
mpi_IO 18066 reiner 1w FIFO 0,9 30140 pipe
mpi_IO 18066 reiner 2w FIFO 0,9 30141 pipe
mpi_IO 18066 reiner 3u CHR 136,0 2 /dev/pts/0
mpi_IO 18066 reiner 4u CHR 136,0 2 /dev/pts/0
mpi_IO 18066 reiner 5w REG 8,19 0 436941586 /tmp/.arraysvcs/errlog0a900000425d04b3
mpi_IO 18066 reiner 6u CHR 10,59 134751937 /dev/xpmem
mpi_IO 18066 reiner 7u IPv4 30107 TCP dcm27.munich.sgi.com:32794->dcm27.munich.sgi.com:32792 (ESTABLISHED)
mpi_IO 18066 reiner 8u REG 252,0 17179869184 268438051 /mnt/fcscratch/reiner/matrix_8.dat
mpi_IO 18066 reiner 9u REG 252,0 17179869184 268438051 /mnt/fcscratch
.
.
.
January 19, 2005 Page 5| |
I/O System Calls
–open Open a file and return the file desciptor–read Read n bytes into user memory from a file–write Write n bytes to a file from user memory –lseek Position file offset pointer n bytes into a file –pread Read n bytes into user seek location–pwrite Write n bytes to a file from user seek location–readv Read n bytes into user buffer vector from a file–writev Write n bytes to a file from user buffer vector–close Close the file and release pointers like fildes–fcntl Control file and file descriptor attributes–ioctl–mmap Map a file into memory but handle like a file
January 19, 2005 Page 6| |
I/O Hierarchy
Diskcache
Text
Stack
Library
Librarybuffer
Dataarray
a.out
biosize
User space Kernel space
System Calls
FilesystemBufferCache
Disk
January 19, 2005 Page 7| |
Buffered I/O
–Default C I/O library–Goes through file system buffer cache
• Read ahead could be bad for small random I/O–Slowest bandwidth I/O, Good for latency of small IOPs
• bcopy is bandwidth restrictive routine• bcopy used for memory to memory transfers
–Delayed write data written to disk by bdflush, pagebufd, xfsbuf• Lazy writes having kernel chunk data to be ready for disk• Kernel charged for the I/O, program could have exited• Sectorizes data presented to disk
January 19, 2005 Page 8| |
Direct I/O
–Direct I/O is better bandwidth than buffered I/O–Bypasses the kernel filesystem buffer cache
•Still uses filesystem control meta-data–Direct DMA access to/from disks from/to user space–Very low CPU utilization, high bandwidth
•Often combined with asynchronous I/O–XFS Filesystems only
•open (path, O_DIRECT)•ioctl checks instead of IRIX's way with an fcntl call
»ioctl(fd, XFS_IOC_DIOINFO, &dioinfo)–User alignment requirements
• Filesystem block size (mkfs -b) whole multiple
January 19, 2005 Page 9| |
Direct I/O Alignment
January 19, 2005 Page 10| |
Sequential I/O
–Uses C stdio buffering, 16KB library buffer by default• Changeable with library call setbuf(3)
–Each write has length stored before-after data write–Buffer characteristics same as those with stdio for C–One user library buffer per logical unit
• Unless readv or writev which is a multiple linked list of buffers
January 19, 2005 Page 11| |
Direct Access Random I/O
•Record referencing WRITE(rec=,unit)•Doesn’t use C stdio•Random I/O with lseeks to a position in a file
–Could be sequential rewinds or appends, watch the position–Located in memory of CPU that created it
January 19, 2005 Page 12| |
Formatted Vs Unformatted
•Formatted–Human readable–Requires additional CPU time–Goes through file system buffer cache to sectorize
•Unformatted–Less CPU overhead–post processed
January 19, 2005 Page 13| |
C Unformatted I/O
•Default user library buffer is 16KB–Can be changed with setbuf(3)
•Writes store up to 16KB before system call to write•Read system call size is 16KB
January 19, 2005 Page 14| |
FORTRAN Unformatted I/O
•Direct access and Sequential behave differently–open(form=‘unformatted’,access=‘direct’) Fortran buffering–open(form=‘unformatted’) uses std C IO
•Not OPENMP safe–Places lock on logical unit–Multiple threads can write to multiple files connected to different
logical units in parallel–Multiple threads can safely use the same logical unit–Multiple threads cannot safely write to different logical unit
associated with the same file
January 19, 2005 Page 15| |
Synchronous I/O
•read(2), write(2) are synchronous calls–Process goes to sleep until IO is done–Good if there is nothing to do until the I/O is done–writes ie delay write buffering is truly asynchronous
January 19, 2005 Page 16| |
Asynchronous I/O
•POSIX 1003.1b•Database (DBM) improves transaction performance•FFIO also uses async I/O•See info libc -> low level I/O – Asynchronous•Library Support
–aio_read Asynchronous I/O reads–aio_write Asynchronous I/O writes–lio_listio Queue arbitrary list of I/O requests
•XFS kernel support with SGI 2.6 Kernel
January 19, 2005 Page 17| |
Foreign Dataset Conversion
•IRIX is Big Endian•Linux/Intel is Little Endian•Foreign Datasets, see ifort man page•Intel 8.0 Compilers have a -convert option
–big_endian–little_endian–cray–Vax - fdx, fgx, vaxd, vaxg–ibm
January 19, 2005 Page 18| |
Flexable File IO Layering
•SGI ProPack contains the FFIO package•Allows I/O attribute modification from command line. •Carried over from Cray UNICOS and IRIX•C programs must use ffopen, ffread, ffwrite to recognize assign attributes
•See Application Programmer I/O Guide for details•NO INTEL FORTRAN I/O LIBRARY SUPPORT
–However, you can use the --wrap loader option to overwrite referencesto open,read,write,lseek
•NO FOREIGN DATASET CONVERSION
January 19, 2005 Page 19| |
C FFIO Example
#include 'ffio.h'#include <fcntl.h>#include <unistd.h>#define open (n,o,p) ffopen(n,o,p)#define close (f) ffclose (f)#define read (f,b,l) ffread (f,b,l)#define write (f,b,l) ffwrite (f,b,l)#define lseek (f,o,w) ffseek (f,o,w)
main(){ int fd, ret ; char *data_ptr = "abcd" ; fd = open("file.dat", O_RDWR|O_CREAT , 0640 ) ; ret = lseek( fd , 3001 , SEEK_SET ) ; ret = write( fd , data_ptr , 4) ; close(fd);}
January 19, 2005 Page 20| |
FF_IO Environment Variables
•FF_IO_LOGFILE Profile statistics•FF_IO_OPEN_DIAGS open diagnostics•FF_IO_OPTS Verbose configure FFIO•FF_IO_DEFAULTS Short configure FFIO•FF_IO_AIO_THREADS Num of IO threads•FF_IO_AIO_LOCKS Num of locks•FF_IO_AIO_NUMUSERS Num of users•FF_IO_TRACE_FILE Trace file•FF_IO_RECOVER_CMD I/O Recovery•FF_IO_FILESIZEESTIMATE Preallocate (Deferred)
January 19, 2005 Page 21| |
eie Layer
•direct Layer style•diag | no diag Diagnostics•wb | nowb | hldwb Write behind, hold WB•save | scr Save or remove file•rls | norls Release or norelease FD•bpons | nobpons Bypass•pagesize Page size•numpages Num of pages•max_lead Pages read-ahead•share Private or shared•stride Page stride (1 is default)•alloc Prealloc (Deferred)
January 19, 2005 Page 22| |
event Layer
•trace | notrace Trace I/O•rtc | cpc RTC or CPU clock•diag | nodiag | brief | summary Report verbosity
January 19, 2005 Page 23| |
FFIO Compile ExampleOld approach:•Fortran•ifort -Wl,-u _ffopen -Wl,--wrap open -Wl,--wrap open64 -Wl,--wrap lseek64 -Wl,--wrap lseek -Wl,--wrap read -Wl,--wrap write -Wl,--wrap close test.f ./libeag_ffio.a ./libffio.a -lrt
•Read man page ld(1) with respect to --wrap •C/C++•icc -D_LITTLE_ENDIAN -g \ -o nastbio -Wl,-u _ffopen \ -leag_ffio -lffio -lrt nastbio.c
New approach:•Fortranifort test.f•C/C++icc -D_LITTLE_ENDIAN -g nastbio.csetenv LD_PRELOAD /usr/local/lib/libFFIO.so./a.out.....
January 19, 2005 Page 24| |
FFIO Examples
limit stacksize 655360
setenv FF_IO_OPTS '*.SCR*(eie.direct.diag.mbytes:1024:256:2:1:1:0,event.summary.mbytes.notrace)'
nast2001 jid=test mem=200m ...
export FF_IO_OPTS='cachea.mem:256:32:2:1,event.summary'
nastbio test.bio
January 19, 2005 Page 25| |
FFIO eie Layer Numericssetenv FF_IO_OPTS '*.SCR*(eie.direct:1024:256:2:1:1:0)'
page_size: unit is 4k pages
num_page
max_lead: Number of pages read ahead.
share: 0: cache is private,
1: shared by a couple of files
stride
alloc: Cache requests alloc pages from the kernel if writes from the cache extend file. (On Altix deferred.)
January 19, 2005 Page 26| |
Event Layer Reportevent_close(SCRATCH16698 ) eie <-->syscall ( 39 mbytes)/( 0.12 s)= 315.29 mbytes/s
oflags=0x0000000000004242=RDWR+CREAT+TRUNC+DIRECT
sector size =4096(bytes)
cblks =0 cbits =0x0000000000000000
current file size =21 mbytes high water file size =21 mbytes
function times wall all mbytes mbytes min max avg ill
called time hidden requested delivered request request request formed
open 1 0.00
seek 5 0.00
writea 8 0.00 0 39 39 1 15 5 0
fcntl
recall
writea 8 0.12
other 6 0.00
flush 1 0.00
close 1 0.00
extends 4
January 19, 2005 Page 27| |
FFIO Performance Example
•NASTRAN jobs under tied memory conditions–4 serial jobs on 4 CPUS and 64 GB Memory. Each job needs ~16 GB.
•FFIO
•Linux I/O Buffer Cache
job1:13615.13user 290.99system 5:24:32elapsed 71%CPU
job2:13576.73user 301.75system 5:14:24elapsed 73%CPU
job3:13576.88user 214.44system 4:59:04elapsed 76%CPU
job4:13562.33user 215.55system 5:00:03elapsed 76%CPU
job1:10658.47user 4699.10system 7:11:44elapsed 59%CPU
job2:10519.47user 4728.90system 7:05:25elapsed 59%CPU
job3:10460.77user 4303.55system 6:45:08elapsed 60%CPU
job4:10460.63user 4317.87system 6:45:38elapsed 60%CPU
January 19, 2005 Page 28| |
FFIO, When should I Use It ?
•You know the data access patterns of your application
•You are running your application under high memory pressure–Memory pages of the buffer cache are recycled for the memory demands
of the application
• Data sets are exceeding the memory available in your current working set.
•Libraries like libnetcdf can cause ill-conditioned access patterns
–Slicing and Concatenating of NetCDF files are performance hogs.
–NCO tools can be significantly enhanced using the cache layer of FFIO.
January 19, 2005 Page 29| |
MPI-I/O
•Implemented on top of all I/O schemes discussed so fare:–synchronous–asynchronous
•Performs additional scheduling of tasks on collective I/O operations like reading/writing from/to one file.
•Expect MPI-I/O files to be not portable between different hosts
and MPI-I/O implementations.
January 19, 2005 Page 30| |
MPI-I/O
•I/O was not part of MPI-1 –Asynchronous I/O only by means of different MPI tasks or AIO with
threads
•MPI-2 provided an I/O interface implemented on top of the methods discussed
–Asynchronous and synchronous I/O possible
–Additional buffer management and locking mechanism to allow for collective operation on ONE file.
–MPI hints describe• File access methods and file system layout like stripe size and stripe unit• Array layout and sizes (if not given as a derived type)• Configuration of MPI internal buffer caches for data sieving
•Implementation on Altix based on ROMIO–http://www-unix.mcs.anl.gov/romio/
–http://www-unix.mcs.anl.gov/~thakur/papers/romio-coll.pdf
January 19, 2005 Page 31| |
MPI-I/O Essentials
•File handle manipulation–MPI_FILE_OPEN(comm, filename, amode, info, fh, ierror)
•Synchronous–MPI_FILE_WRITE(fh, buf, count, datatype, status, ierror)
–MPI_FILE_READ(fh, buf, count, datatype, status, ierror)
•Asynchronous–MPI_FILE_IWRITE(fh, buf, count, datatype, request, ierror)
–MPI_FILE_IREAD(fh, buf, count, datatype,request, ierror)
–MPI_WAIT(request, status(MPI_STATUS_SIZE),ierror)
•Declaring access patterns, file system specs, etc–MPI_FILE_SET_VIEW(fh, disp, etype, filetype,datarep, info, ierror)
–MPI_Info_create(info), MPI_Info_set(info, key, value),MPI_INFO_NULL
January 19, 2005 Page 32| |
MPI-I/O And Direct I/O
•You can turn on globally direct I/O via the environment:–setenv MPI_DIRECT_READ true
–setenv MPI_DIRECT_WRITE true
•Don't mix different I/O mechanism during a run !!!
January 19, 2005 Page 33| |
MPI I/O And Data Sieving
Reserved File Hints:cb_block_sizecb_buffer_sizecb_nodescollective_buffering
January 19, 2005 Page 34| |
MPI-IO Tuning: Step 0
•Many independent, contiguous requests– No access information available to MPI system at runtime
MPI_File_open (MPI_COMM_SELF,'filename', ..., &fh);
for (i=0; i<n_rows; i++) {
MPI_File_seek (fh,....)
MPI_File_read(fh,row[i],...)
}
January 19, 2005 Page 35| |
MPI-IO Tuning: Step 1
•Many collective, contiguous requests– MPI implementation expects to see same access patterns at multiple
sites
–Can lead to good read-ahead, prefetching descisions when implementation sees patterns repeat at different processors
MPI_Info_Set (&info, collectiv_buffering,'true');
MPI_File_open (MPI_COMM_WORLD,'filename',...,&info, &fh);
for (i=0; i<n_rows; i++) {
MPI_File_seek (fh,....)
MPI_File_read(fh,row[i],...)
}
January 19, 2005 Page 36| |
MPI-IO Tuning: Step 2
•Single independent, non-contiguous requests– Data sieving can be used.
–Based on an application defined data type
MPI_Type_create_subarray (..., &subarray, ...);
MPI_Type_commit(&subarray);
MPI_File_open (MPI_COMM_SELF,'filename',..., &fh);
MPI_File_set_view(fh,...,&subarray,...);
MPI_File_read(fh,local_array,...);
January 19, 2005 Page 37| |
MPI-IO Tuning: Step 3
•Single single collective, non-contiguous requests– Data sieving can be used.
– Collective I/O can be used
MPI_Type_create_subarray (..., &subarray, ...);
MPI_Type_commit(&subarray);
MPI_File_open (MPI_COMM_WORLD,'filename',..., &fh);
MPI_File_set_view(fh,...,&subarray,...);
MPI_File_read(fh,local_array,...);
January 19, 2005 Page 38| |
MPI-IO Performance
MPI-IO Benchmark
================
16384 MBytes written/read in 32 blocks a 64 MBytes
by 8 Processes to /fastfs/reiner
using the min. time of 5 repetitions
WRITE: Time 16.211s => 1010.675 MBytes/s
READ: Time 9.789s => 1673.723 MBytes/s
January 19, 2005 Page 39| |
MPI-I/O Hints for Performance sprintf(key,"striping_unit");
sprintf(value,"524288");
MPI_Info_set(myinfo, key,value);
sprintf(key,"striping_factor");
sprintf(value,"16");
MPI_Info_set(myinfo, key,value);
sprintf(key,"collective_buffering");
sprintf(value,"true");
MPI_Info_set(myinfo, key,value);
sprintf(key,"cb_block_size");
sprintf(value,"131072");
MPI_Info_set(myinfo, key,value);
sprintf(key,"cb_buffer_size");
sprintf(value,"1048576");
MPI_Info_set(myinfo, key,value);
sprintf(key,"cb_nodes");
sprintf(value,"8");
MPI_Info_set(myinfo, key,value);
}
main(){
MPI_Barrier(MPI_COMM_WORLD);
.
MPI_Info_create(&myinfo);
set_info(myinfo);
...
}
void set_info(MPI_Info myinfo)
{ char key[80],value[80];
January 19, 2005 Page 40| |
strace Commandlinux# strace cp install.log /tmpexecve("/bin/cp", ["cp", "install.log", "/tmp"], [/* 22 vars */]) = 0uname({sys="Linux", node="pc-daw", ...}) = 0brk(0) = 0x8054ea8open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory)open("/etc/ld.so.cache", O_RDONLY) = 3fstat64(3, {st_mode=S_IFREG|0644, st_size=73450, ...}) = 0old_mmap(NULL, 73450, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40013000close(3) = 0...snip.....open("install.log", O_RDONLY|O_LARGEFILE) = 3fstat64(3, {st_mode=S_IFREG|0644, st_size=23685, ...}) = 0open("/tmp/install.log", O_WRONLY|O_CREAT|O_LARGEFILE, 0100644) = 4fstat64(4, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0fstat64(3, {st_mode=S_IFREG|0644, st_size=23685, ...}) = 0read(3, "Installing 773 packages\n\nInstall"..., 4096) = 4096write(4, "Installing 773 packages\n\nInstall"..., 4096) = 4096read(3, "nstalling apmd-3.0.2-12.\nInstall"..., 4096) = 4096write(4, "nstalling apmd-3.0.2-12.\nInstall"..., 4096) = 4096read(3, "lling hotplug-2002_04_01-13.\nIns"..., 4096) = 4096write(4, "lling hotplug-2002_04_01-13.\nIns"..., 4096) = 4096read(3, "talling alchemist-1.0.24-4.\nInst"..., 4096) = 4096write(4, "talling alchemist-1.0.24-4.\nInst"..., 4096) = 4096read(3, "-3.\nInstalling gal-devel-0.19.2-"..., 4096) = 4096write(4, "-3.\nInstalling gal-devel-0.19.2-"..., 4096) = 4096read(3, "ksnapshot-3.0.3-3.\nInstalling kt"..., 4096) = 3205write(4, "ksnapshot-3.0.3-3.\nInstalling kt"..., 3205) = 3205read(3, "", 4096) = 0close(4) = 0close(3) = 0_exit(0)
January 19, 2005 Page 41| |
Tracing I/O
Text
Stack
Library
Librarybuffer
Dataarray
a.out
biosizeText
Stack
Library
Librarybuffer
Dataarray
a.out
biosize
strace
January 19, 2005 Page 42| |
Default Buffer Size History
1) Buffers set to the disk geometry (like sectors per track)C/H/S Cylinder Head SectorBefore there were system caches, slow on reuse
2) Cache pages set to the disk geometry (like sectors per track)When writing through buffer cache more than reading
3) Cache pages set to request sizeWhen cache is for reading more than writingZBR and RAID devices have different geometries
4) Privately cache within user spaceAvoids someone else polluting what you have in cacheTakes more user memoryUser time increases instead of system time
5) Negotiate with sync,stat, fcntl system callsPortable from one filesystem to another
January 19, 2005 Page 43| |
Summary
•Know your I/O characteristics–cached – direct–sync – async–sequential - random
•strace shows the read and write system calls•FFIO in C can modify I/O characteristics without rewriting the application
–FF_IO environment variables checked at run time•FFIO provides user level cache management•FFIO has a trace of library I/O calls
January 19, 2005 Page 44| |
Additional References
•open(2),read(2),write(2),lseek(2), stat(2),fcntl(2),setbuf(3), ioctl, strace(1), lsof(1) man pages
•info libc•info libc -> low level I/O –> Asyncronous•Application Programmer I/O Guide 007-3695-004
January 19, 2005 Page 45| |