CIO Article

  • Upload
    zealorz

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

  • 8/6/2019 CIO Article

    1/25

    August 2004

    Improve Database Performance on

    File System Containers in IBM DB2Universal DatabaseV8.2

    using Concurrent I/O on AIX

    Kenton DeLathouwerandAlan Y LeeIBM Software Group

    Punit ShahIBM pSeries Solution Enablement Group

  • 8/6/2019 CIO Article

    2/25

    Contents

    1 Introduction................................................................................................................. 2

    2 Background................................................................................................................. 3

    2.1 DB2 Storage Model ............................................................................................ 32.2 I/O modes available with SMS and DMS files on File System.......................... 4

    2.2.1 File system buffered I/O ............................................................................. 4

    2.2.2 Memory mapped I/O................................................................................... 5

    2.2.3 Direct I/O .................................................................................................... 62.2.4 Concurrent I/O ............................................................................................ 6

    2.3 I/O modes available with DMS on Raw Device................................................. 8

    2.3.1 Raw I/O....................................................................................................... 83 Enabling Direct I/O or Concurrent I/O in DB2 UDB V8.2 ........................................ 8

    3.1 Enablement at Tablespace Level ........................................................................ 8

    3.2 Enablement at File System Level ....................................................................... 9

    4 Recommended OS Maintenance Levels & Fixes ....................................................... 95 Performance Tests..................................................................................................... 10

    5.1 DB2 Performance Considerations using Direct I/O and Concurrent I/O ......... 105.1.1 Disk Throughput ....................................................................................... 11

    5.1.2 CPU Utilization......................................................................................... 12

    5.1.3 Memory Utilization................................................................................... 13

    5.2 Test System Configuration ............................................................................... 145.3 Performance Results ......................................................................................... 15

    5.3.1 Concurrent I/O I/O Bound Results ........................................................ 16

    5.3.2 Concurrent I/O CPU Bound Results...................................................... 175.3.3 Direct I/O Results on JFS File Systems.................................................... 20

    6 Conclusion ................................................................................................................ 207 Appendix................................................................................................................... 21

    7.1 Appendix A: Direct I/O Limitations................................................................. 21

    7.1.1 Non-coalesced Vectored I/O results in poor throughput .......................... 21

    7.1.2 Direct I/O buffer requirements.................................................................. 21

    8 References................................................................................................................. 239 Acknowledgement .................................................................................................... 23

    10 Glossary ................................................................................................................ 23

  • 8/6/2019 CIO Article

    3/25

    - 2 -

    1 Introduction

    Database server performance, typically measured in terms of the transaction throughput

    and application response time, depends heavily on the I/O subsystem performance. To

    attain the best possible I/O throughput, database table data layout demands specialattention from database and system administrators. Today this is more pressing than it hasever been, considering the ever-widening gap between processor and disk speed. Chosen

    data layout has great impact on the manageability and extensibility of the database

    storage requirements.

    Prior to the Version 8.2 release, DB2 UDB performed I/O operations in one of the threemodes:

    File systems with buffered I/O

    File systems with memory mapped I/O (AIX only)

    Raw I/O.

    For years, raw I/O (or raw devices) has been the preferred choice for I/O intensive

    database workloads mainly because of its superior performance. Its remarkable

    performance is due to the fact that it bypasses the caching and locking mechanisms that

    are normally associated with file systems. It offers a direct route to the physical device.However, raw I/O presents significant challenges for storage management. Conversely,

    file system storage is easy to manage and extend, but this ease of management comes at

    the cost of lower performance, especially for the database workloads.

    In recent years, file system vendors have developed an alternate I/O mechanism, called

    Direct I/O (DIO), which aims at eliminating one of the performance bottlenecks fordatabase applications. The concept behind Direct I/O is similar to raw I/O in the sense

    that they both bypass caching at the file system level. This reduces CPU overhead and

    makes more memory available to the database instance, which can make more efficient

    use of it for its own purposes.

    While Direct I/O has reduced the gap of performance tradeoffs between file systems and

    raw devices, some of the fundamental features of file systems, such as ensuring dataintegrity, still impact database performance. In 2003, IBM introduced a new file system

    feature called Concurrent I/O (CIO) for the Enhanced Journaling File System (JFS2) in

    AIX 5L version 5.2.10 maintenance update. This feature includes all the advantages of

    Direct I/O and also relieves the serialization of write accesses. Database performanceachieved using CIO with JFS2 is comparable to that obtained by using raw I/O.

    Starting with the Version 8.2 release, DB2 UDB supports DIO/CIO on AIX, and DIO on

    HP, Solaris, Linux, and Windows. The focus of this article is on AIX only. The

    background section first discusses various application I/O models available on AIX. It is

    then followed by a description of how DB2 takes advantage of the CIO feature.

  • 8/6/2019 CIO Article

    4/25

    - 3 -

    Subsequent sections include our performance test environment, configuration, results,and conclusion with recommendations.

    2 Background

    2.1 DB2 Storage Model

    For each database created, DB2 UDB organizes storage in a hierarchy of logical storage

    management units called containers and tablespaces. Database table data and metadata(indexes) are stored in specified tablespaces or in default tablespaces. The tablespace is

    made up of one or more containers (a container cannot be a part or more than one

    tablespace, however).

    A container is a physical disk storage unit, such as an AIX logical volume, file systemfile or a directory. If a tablespace contains more than one container, then DB2 stripes data

    across the containers. A database can have as many tablespaces as needed with somesystem-placed limits on the upper bound.

    There are two types of tablespaces:

    System Managed Space (SMS)

    Database Managed Space (DMS)

    In SMS tablespace, the operating systems file system manager allocates and manages the

    space where the table is stored. The storage model typically consists of many filesrepresenting table objects, stored in the file system space. Storage space in the SMS

    tablespace is allocated as needed.

    On the other hand, for the DMS tablespace, DB2 database manager controls the storage

    space. This storage model consists of a number of devices or operating system filesystem files whose space is managed by DB2 UDB. The database administrator decides

    which devices and files to use, and DB2 UDB manages the space on those. Storage space

    in the DMS tablespace is pre-allocated. The DMS tablespace is essentially animplementation of a special purpose file system designed to best meet the needs of the

    DB2 database manager.

    Every database has unique I/O characteristics for its tablespaces' underlying data files. Itis essential that each database I/O be optimized for maximum performance. A default

    I/O solution implemented for all tablespaces may not be acceptable. The following

    sections describe various I/O modes available in DB2 UDB.

  • 8/6/2019 CIO Article

    5/25

    - 4 -

    2.2 I/O modes available with SMS and DMS files on File System

    2.2.1 File system buffered I/O

    At the most basic level, a file is simply a collection of bits stored on persistent media.When a process needs to access data from a file, the operating system brings the data into

    main memory, where the process can examine it, alter it, and then request that the data besaved to disk. The operating system could read and write data directly to and from the

    disk for each request, but the response time and throughput would be poor due to slowdisk access times. The operating system therefore attempts to minimize the frequency of

    disk accesses by buffering data in main memory, within a structure called the file system

    buffer cache.

    On a file read request, the file system first attempts to read the requested data from the

    buffer cache. If the data is not already present in the buffer cache, it is read from diskand cached in the buffer cache. Figures 1 and 2 show the sequence of actions that take

    place when a read request is issued under this caching policy.

    Figure 1 - Read data page, file system buffer

    cache hitFigure 2 - Read data page, file system buffer

    cache miss

    Similarly, writes to a file are cached so that future reads can be satisfied without

    necessitating a disk access, and to reduce the frequency of disk writes. The use of a filesystem buffer cache can be extremely effective when the cache hit rate is high. It also

    enables the use of sequential read-ahead and write-behind policies to reduce thefrequency of physical disk I/Os. Another benefit is in making file writes asynchronous,since the application can continue execution without waiting for the disk write to

    complete. Figure 3 shows the sequence of actions for a write request under cached I/O.

    Application buffer Application

    File buffer cache

    Disk

    1

    2

    3

    45

    6

    1. Application issues a read request2. Kernel looks for requested data in the file buffer cache

    3. Requested data not present in file buffer cache4. Kernel reads data from disk5. Read data is cached in file buffer cache6. Read data is copied from the file buffer cache to the

    application buffer

    K

    ER

    NE

    L

    1. Application issues read request2. Requested data found in file buffer cache3. Requested data copied over to application buffer

    Application buffer Application

    File buffer cache

    Disk

    1

    23 K

    ER

    NE

    L

  • 8/6/2019 CIO Article

    6/25

    - 5 -

    While the file system buffer cacheimproves I/O performance, it also

    consumes a significant portion of system

    memory. AIXs Enhanced JFS, also knownas JFS2, allows the system administrator to

    control the maximum amount of memorythat can be used by the file system forcaching. JFS2 uses a certain percentage of

    real memory for its file system buffer cache,

    specified by the maxclient% parameter;

    maxclient% value can be tuned via the

    vmo command.

    2.2.2 Memory mapped I/O

    For the memory mapped I/O (MMAP I/O), the file data of a file being accessed is

    mapped into address space of a process (that is, a DB2 agent). A file can be mapped intothe address space of more than one process, enabling concurrent read or write access of

    the file. A process, potentially can map multiple files into its address space. An

    application must ensure file data integrity if a file is simultaneously mapped by multiple

    processes.

    Once a file is mapped into an address space of a process, subsequent access of the file

    data by a process is carried out just as accessing any other memory location. MMAP I/Ois subjected to JFS2 file system buffering and prefetching just like regular file system I/O.

    For these reasons, JFS2 does not support MMAP I/O with Direct I/O or its variant

    Concurrent I/O. Concurrent I/O is discussed later in this section.

    The MMAP I/O has following advantages:

    Parallel write access of file data without inode exclusive write locks.

    File data is accessed without expensive read() and write() system calls1.

    By default, MMAP I/O is on for file system containers (provided file system is not

    mounted with o cio or o dio option) to improve performance. MMAP I/O for DB2

    UDB is controlled by DB2_MMAP_READ and DB2_MMAP_WRITE registryvariables. These can be turned off manually by setting them to OFF.

    1 Elimination of read(), write() system calls doesnt mean elimination of disk read, write operations.

    Somebody has to perform those reads and writes, for memory mapped I/O AIX5L kernel performs disk

    accesses without application initiating it through read() or write() system calls.

    Figure 3 - File system buffered write

    Application buffer Application

    File buffer cache

    Disk

    1

    4

    5

    2

    1. Application issues write request2. Kernel copies data from application buffer to file buffer

    cache

    3. Application continues execution, without waiting for diskwrite.

    4. Periodic flushing of dirty file buffer cache pages initiatedby syncd

    5. Dirty pages written to disk

    K

    E

    RNE

    L

    3

  • 8/6/2019 CIO Article

    7/25

    - 6 -

    2.2.3 Direct I/O

    Certain classes of applications derive no benefit from the file system buffer cache. Some

    technical workloads, for instance, never reuse data due to the sequential nature of theirdata access, resulting in poor buffer cache hit rates. DB2 UDB manages data caching

    using bufferpools, so it does not need the file system caching capability. The use of a file

    system buffer cache results in undesirable overheads in such cases, since data is firstmoved from the disk to the file system buffer cache and from there to the application

    buffer. This double-copying of data results in additional CPU consumption. Also, the

    duplication of application data within the file system buffer cache increases the amountof memory used for the same data, making less memory available for the application, and

    resulting in additional system overheads due to memory management.

    For applications that wish to bypass the buffering of memory within the file system cache,

    Direct I/O is provided as an option on AIX (in both JFS and JFS2). When Direct I/O is

    used for a file, data is transferred directly from the disk to the application buffer, without

    the use of the file system buffer cache.

    To avoid consistency issues, if multiple processes open the same file and one or more

    processes did not specify O_DIRECT while others did, the file stays in the normal cachedI/O mode. Similarly, if the file is mapped in memory through the shmat() or mmap()

    system calls, it stays in normal cached mode. Once the last conflicting, non-direct access

    is eliminated (by using the close(), munmap(), or shmdt() system calls), the file ismoved into Direct I/O mode. The change from caching mode to Direct I/O mode can be

    expensive because all modified pages in memory will have to be flushed to disk at that

    point.

    2.2.4 Concurrent I/O

    Both buffered I/O and Direct I/O have the disadvantage of requiring a per-file write lock,or inode lock. This file system feature helps ensure data integrity and improve fault

    tolerance, but also poses performance bottlenecks for database applications. The details

    of inode locking are described below.

    Inode Locking

    While an application views a file as a contiguous stream of data, this is not actually how a

    file is stored on disk. In reality, a file is stored as a set of (possibly non-contiguous)

    blocks of data on disk. Each file has a data structure associated with it, called an inode.

    The inode contains all the information necessary for a process to access the file, such as

    file ownership, access rights, file size, time of last access or modification, and the

    location of the files data on disk. Since a files data is spread across disk blocks, theinode contains a table of contents to help locate this data. It is important to note the

    distinction between changing the contents of an inode and changing the contents of a file.

  • 8/6/2019 CIO Article

    8/25

    - 7 -

    The contents of a file only change on a write operation. The contents of an inode changewhen the contents of the corresponding file change, or when its owner, permissions, or

    any of the other information that is maintained as part of the inode changes. Thus,

    changing the contents of a file automatically implies a change to the inode, whereas achange to the inode does not imply that the contents of the file have changed. Since

    multiple threads may attempt to change the contents of an inode simultaneously, thiscould result in an inconsistent state of the inode. In order to avoid such race conditions,the inode is protected by a lock, called the inode lock. This lock is used for any access

    that could result in a change to the contents of the inode, preventing other processes from

    accessing the inode while it is in a possibly inconsistent state.

    The inode lock imposes write serialization at the file level. Serializing write accesses

    ensures that data inconsistencies due to overlapping writes do not occur. Serializing

    reads with respect to writes ensures that the application does not read stale data.Sophisticated database applications implement their own data serialization, usually at a

    finer level of granularity than the file. Such applications implement serialization

    mechanisms at the application level to ensure that data inconsistencies do not occur, andthat stale data is not read. Consequently, they do not need the file system to implement

    this serialization for them. The inode lock actually hinders performance in such cases, by

    unnecessarily serializing non-competing data accesses.

    For such applications, AIX 5.2 ML01 and later versions offers the Concurrent I/O (CIO)

    option. Under Concurrent I/O, multiple threads can simultaneously perform reads and

    writes on a shared file. This option is intended primarily for relational databaseapplications, most of which will operate under Concurrent I/O without any modification.

    Applications that do not enforce serialization for accesses to shared files should not useConcurrent I/O, as this could result in data corruption due to competing accesses.

    Under Concurrent I/O, the inode lock is acquired in read-shared mode for both read andwrite accesses. However, in situations where the contents of the inode may change for

    reasons other than a change to the contents of the file (writes), the inode lock is acquired

    in write-exclusive mode. One such situation occurs when a file is extended or truncated.Extending a file may require allocation of new disk blocks for the file, and consequently

    requires an update to the table of contents of the corresponding inode. In this case, the

    read-shared inode lock is upgraded to the write-exclusive mode for the duration of the

    extend operation. Similarly, when a file is truncated, allocated disk blocks might be freedand the inodes table of contents needs to be updated. Upon completion of the extend or

    truncate operation, the inode lock reverts to read-shared mode. This is a very powerful

    feature, since it allows files using Concurrent I/O to grow or shrink in a manner that istransparent to the application, without having to close or reopen files after a resize.

    Another situation that results in the inode lock being acquired in write-exclusive modeoccurs when an I/O request on the file violates the alignment or length restrictions of

    Direct I/O. Alignment violations result in normal cached I/O being used for the file, and

    the inode lock reverts to the read-shared, write-exclusive mode of operation.

  • 8/6/2019 CIO Article

    9/25

    - 8 -

    2.3 I/O modes available with DMS on Raw Device

    2.3.1 Raw I/O

    Raw I/O bypasses the caching and locking mechanisms that are normally associated withfile systems. Raw devices are physical devices that offer a direct route to the device and

    more control over the timing of the I/O to that device. The drawback of Raw I/O is thefact that managing raw devices is considered to be more difficult than managing file

    system.

    3 Enabling Direct I/O or Concurrent I/O in DB2 UDBV8.2

    As mentioned in previous sections, some of the I/O modes are mutually exclusive witheach other. For example, a file opened in CIO mode cannot be memory mapped without

    first closing it. In the Version 8.2 release, DB2 UDB engine turns off theDB2_MMAP_READ and DB2_MMAP_WRITE registry variables automatically if CIOmode is detected.

    3.1 Enablement at Tablespace Level

    New keywords NO FILE SYSTEM CACHING and FILE SYSTEM CACHING havebeen introduced to the CREATE and ALTER TABLESPACE SQL statements to allow

    users to specify whether DIO/CIO is to be used for each tablespace.

    When NO FILE SYSTEM CACHING is in effect for a tablespace, DB2 UDB alwaysattempts to use CIO wherever possible. In cases where CIO is not supported (for

    example, if JFS is used), DIO will be used instead. Examples of the new SQL syntax areshown below:

    Example 1: CREATE TABLESPACE ...

    By default, this new tablespace will be using buffered I/O, the "FILE SYSTEM

    CACHING" token is implied.

    Example 2: CREATE TABLESPACE ... NO FILE

    SYSTEM CACHING

    The new tokens "NO FILE SYSTEM CACHING" indicate that caching at the file

    system level will be OFF for this particular tablespace.

    Example 3: ALTER TABLESPACE ... NO FILE SYSTEM

    CACHING

  • 8/6/2019 CIO Article

    10/25

    - 9 -

    This method of enabling DIO/CIO provides users control of the I/O mode at tablespacelevel. All containers within the same tablespace share the same caching mode. This is

    the recommended method of enabling DIO/CIO with the Version 8.2 release of DB2

    UDB. It can be used for both SMS and DMS tablespaces except for the following:

    SMS large files (LF) SMS large object files (LOB)

    SMS/DMS temporary tablespaces

    The NO FILE SYSTEM CACHING keywords are ignored for the above cases.

    3.2 Enablement at File System Level

    An alternate method for enabling DIO/CIO is to use special mount options when

    mounting the JFS or JFS2 file systems. Table 1 shows the syntax for it:

    File System Types Mount commands

    Direct I/O (JFS) mount -o dio

    Concurrent I/O (JFS2) mount -o cio

    Table 1: Direct I/O and Concurrent I/O mount commands

    When a file system is mounted with the o dio/cio options, all files in the file system use

    DIO/CIO by default. DIO/CIO can be restricted to a subset of the files in a file system by

    placing the files that require DIO/CIO in a separate subdirectory and using namefs to

    mount this subdirectory over the file system. For example, if a file system somefs

    contains some files that prefer to use DIO/CIO and others that do not, we can create asubdirectory, subsomefs, in which we place all the files that require DIO/CIO. We can

    mount somefs without specifying o dio, and then mount subsomefs as a namefs filesystem with the o dio option using the command:

    mount v namefs o dio /somefs/subsomefs /someotherfs.

    This method differs from the previous one in the sense that this can be done outside ofDB2 UDB and if all the containers reside on the same file system, this is a much simpler

    method without changing any of the existing tablespace creation scripts.

    4 Recommended OS Maintenance Levels & Fixes

    The following table shows the recommended AIX maintenance levels and specific fixes

    for DIO/CIO support in the DB2 UDB V8.2 release.

  • 8/6/2019 CIO Article

    11/25

    - 10 -

    Platforms File System Types Recommended Fixes I/O mechanism

    AIX 4.3.3 + JFS None Direct I/O

    AIX 5.1+ JFS, JFS2 None Direct I/O

    AIX 5.2 + JFS None Direct I/O

    AIX 5.2 + JFS2 Maintenance Level 3 and

    additional AIX APARs *

    Concurrent I/O**

    Table 2: Recommended levels on AIX to support DIO/CIO

    * Please refer to the "Known issues for DB2 Universal Database on AIX 5.2" web pagefor a list of specific APARs required:

    http://www-

    1.ibm.com/support/docview.wss?rs=71&context=SSEPGG&q1=IY49385&uid=swg21165448&loc=en_US&cs=utf-8&lang=en+en

    **DB2 uses CIO whenever it is possible. There is no support to use DIO on levels whereCIO is supported.

    5 Performance Tests

    As previously mentioned in section 2.2.2, MMAP I/O is the default I/O mode used in

    DB2 UDB on AIX. MMAP I/O still caches data at file system level but avoids the i-nodecontention problem. It has proved to be superior to using the standard buffered I/O.

    Hence, there are no performance tests done with buffered I/O alone. Any mentioning of

    buffered I/O used in DB2 UDB in this section implies MMAP I/O.

    5.1 DB2 Performance Considerations using Direct I/O andConcurrent I/O

    Since Concurrent I/O implicitly invokes Direct I/O, all the performance considerations

    for Direct I/O hold for Concurrent I/O as well. Thus, applications that benefit from

    filesystem read-ahead, or have a high filesystem buffer cache hit rate, would probably seetheir performance deteriorate with Concurrent I/O, just as it would with Direct I/O. On

    AIX 5.2.10, Concurrent I/O is the preferred method since it is not subject to the

    limitations of buffer size and alignment that limit Direct I/Os effectiveness on JFS andpre AIX 5.2.10 JFS2 file systems.

    The benefits of Direct I/O and Concurrent I/O with DB2 are evident in:

  • 8/6/2019 CIO Article

    12/25

    - 11 -

    Disk throughput

    System CPU utilization

    Memory usage

    With Direct I/O and Concurrent I/O, overall performance improvements are largely a

    result of the freeing up system CPU cycles for use by the application. On I/O boundsystems, the performance gain is less dramatic since the application is waiting on datafrom the disks to continue processing and any extra CPU cycles are wasted.

    5.1.1 Disk Throughput

    Higher sequential read throughput is dependant, in part, on read transfer size andconsequently, the disk transaction rate. On AIX, with file system buffered I/O, maximum

    sequential read throughput on JFS/JFS2 is limited by the largest transfer size the

    sequential read aheadalgorithm will use. The virtual memory manager (VMM) willdetect when sequential read access to a file is occurring. As the sequential file access

    continues, the read ahead algorithm will prefetch up to maxpgahead(j2_maxPageReadAhead on JFS2) pages. If the transfer size requested by theapplication is larger than the size of the JFS/JFS2 maximum read-ahead, then there will

    be no extra benefit seen from file system read-ahead.

    For example, with the default value ofmaxpgahead (j2_maxPageReadAhead) of 8,

    using the default page size of 4096 bytes, the largest read ahead transfer size once the

    algorithm ramps up will be 32KB. To achieve better throughput for sequential reads

    using buffered file systems, the maximum read-ahead page count should be set to valuelarger than the default.

    Read throughput tests comparing reads from raw logical volumes, Concurrent I/Omounted JFS2 and JFS2 buffered usingj2_maxPageReadAhead settings of 8 and 128

    were compared in Figure 4. Results showed that read throughput using transfer sizes

    larger than 256k resulted in roughly the same throughput for raw logical volumes,Concurrent I/O and for buffered I/O using aj2_maxPageReadAhead setting of 128.

  • 8/6/2019 CIO Article

    13/25

    - 12 -

    FAStT Throughput

    0.00

    100.00

    200.00

    300.00

    400.00

    1024

    2048

    4096

    8192

    16384

    3276

    8

    6553

    6

    131072

    262144

    524288

    104857

    6

    209715

    2

    419430

    4

    838860

    8

    Transfer Size (bytes)

    Throughput(MB/s)

    RAW JFS2 8 JFS2 128 JFS2 CIO

    Figure 4: Sequential Read Throughput using various Transfer Sizes

    Each read/write application call with Direct I/O and Concurrent I/O will result in a disk

    access. When smaller read transfer sizes are used, a higher transaction rate will result and

    as the read transfer size requested by the application increases, the throughput onConcurrent I/O increases to the maximum throughput limit of the storage system. The

    throughput of raw LV and Concurrent I/O mounted JFS2 is roughly equal for different

    transfer sizes since both bypass the file system cache. For raw I/O, requests are handledat the LVM layer, bypassing the JFS/JFS2 entirely while Concurrent I/O mounted JFS2

    achieves the same throughput characteristics by bypassing the file system cache. On

    JFS/JFS2 using buffered I/O, max sequential read throughput is affected by the maximum

    sequential prefetch size. With file system buffered I/O, read increments are determinedby the sequential read ahead algorithm irregardless of the transfer size requested by theapplication. Test results show that sequential read throughput using a value of 128 for

    j2_maxPageReadAhead can achieve sustained throughput comparable to that of RAW

    logical volumes.

    In the test environments, DB2 prefetchers were configured to perform large reads

    (128KB 256KB) in order to achieve the optimal read throughput. For the buffered JFS2tests, a value of 128 was used forj2_maxPageReadAhead.

    5.1.2 CPU Utilization

    With file system caching, the VMM performs CPU intensive, memory to memory movesfrom the file system cache to the application I/O buffer. In addition to this, the sync

    daemon periodically checks the file system cache for dirty pages and writes them out to

  • 8/6/2019 CIO Article

    14/25

    - 13 -

    disk. These services incur an overhead and additional system CPU cycles are required.However, some applications will see no benefit of maintaining a file system cache since

    the data is not accessed again before it is swapped out. In the case of most large database

    applications, file caching is handled bythe application. Applications such as

    these that see little or no benefit fromfile system caching will see significantgains in performance as a result of the

    extra CPU resources made available by

    bypassing the file system cache entirely.

    This is most relevant for CPU-constricted environments

    When testing a CPU-bound query, SMSfile containers with Concurrent I/O and

    RAW logical volume containers

    indicate the same level of system CPUutilization. However, SMS containers

    using buffered I/O use an average of

    6% more system CPU. The

    performance gain achieved as a result ofthe extra cycles yielded a 14% gain in

    query run time (see Figure 5).

    5.1.3 Memory Utilization

    As with CPU utilization, file system caching requires additional memory resources asmemory frames are required for use as file cache. On systems where memory is

    overcommitted, the benefit of having file system caching may not be worth the memory

    resources needed for it. For example, in most large database applications, the database

    software does its own I/O caching. If the file system cache hit ratio is very low, thebenefit of the file system cache is negligible. In these situations, memory can often bebetter utilized by the database application for sorting, prefetching, and so on.

    The virtual memory manager (VMM) can be configured to only use a percentage of realmemory page frames for file pages. The vmo parameters (vmtune on AIX 5.1 andbelow) maxperm% and maxclient% (on JFS2) have default values of 80. With this

    default setting, 80% of real memory frames can be used as file pages before the VMM

    page stealing algorithm will only steal file frames. Normally, the VMM only steals filepages, but will steal computational pages if the repage rate is lower for them. For the test

    workload, where the majority of the system memory will be utilized by DB2 for the

    bufferpools and/or sortheap, this parameter was set to 20% in order to minimize the

    System% CPU Utilization on CPU Bound

    Query

    0

    4

    8

    12

    16

    Time

    CPU%

    RAW JFS2 Buffered JFS2 CIO

    Figure 5: System CPU Utilization of CPU

    constricted query

  • 8/6/2019 CIO Article

    15/25

    - 14 -

    computational pages that were swapped out in favor of file pages and to avoid repaging.The advantage of using RAW logical volumes or direct/Concurrent I/O is that no

    memory frames will be used for file pages as the file system cache is not being used.

    The results were obtained from a DSS type workload where large tablescans (sequential

    reads from tables) were common in the tested queries. The characteristics of thisworkload are such that the cache (application or OS level) is largely used for sequentialread-ahead. Test results will show that this type of workload sees large benefits from

    using Concurrent I/O, as the JFS/JFS2 sequential read ahead is redundant because DB2

    performs its own prefetching.

    5.2 Test System Configuration

    Tests were run using three different system configurations. For Concurrent I/O, the

    system described in Table 3 was used to gather results for I/O and CPU constricted

    environments. These tests were run on the same physical system. In order to simulate aCPU bound workload, the number of active CPUs was reduced by using AIX resource

    sets to achieve a decrease in the CPU processing capabilities.

    In order to test Direct I/O, results were gathered from the system configuration described

    in Table 4. This system was configured to meet the requirements for Direct I/O. A

    smaller database using containers on non - large file enabledJFS was used. With non -large file enabledfile systems, the JFS Direct I/O buffer size and alignment requirements

    do not cause a problem with DB2. Additionally, to circumvent the non coalescing

    vectored read/write issues using Direct I/O that affect JFS, block based bufferpools wereutilized. (Refer to Appendix A for details on Direct I/O requirements and its limitations).

    For RAW logical volumes and file systems, three data and one temporary tablespacewere used. The DSS workload consisted of 8 tables, one of which accounted for ~70% of

    the data. For fair comparisons between raw logical volume and file system tests, AIX

    logical volumes were configured identically. For the tests on file system containers, an

    equal number of JFS2 file systems were created that used the same physical disks as theraw logical volumes used in the raw tests.

    Test Configuration #1: Concurrent I/O

    System

    Machine IBM p690

    Processor 8 x PowerPC Power4 1300Mhz

    Memory 32 GBCache L2 cache 1440 Kbytes

    OS AIX5L 5.2 ML02 64bit

    Storage

    FAStT900 80 disks -16 x 4+p RAID5 Arrays

    ESS Shark

    DB2 Configuration

    Version DB2 UDB v8.2 EEE

    Workload OLAP ~100GB, 22 Queries

    Partitions 8 Logical Nodes

    DB2 Tablespace Configuration

  • 8/6/2019 CIO Article

    16/25

    - 15 -

    Tablespace Storage Arrays Tablespace

    Page Size

    Size NODE

    DATA1 FAStT900 16 16KB 100GB 0-7

    DATA2 ESS Shark 16 16KB 48GB 0-7

    DATA3 ESS Shark 1 4KB < 1GB 0

    TEMP ESS Shark 16 4KB 200GB 0-7

    Table 3: System Configuration for Concurrent I/O results

    Test Configuration #2: Direct I/O

    System

    Machine

    Processor 12 x PowerPC Power4 262Mhz

    Memory 16 GB

    Cache L2 cache 1440 Kbytes

    OS AIX5L 5.2 ML02 64bit

    Storage

    SSA 24 x 18GB SSA160 disks

    DB2

    Version DB2 UDB v8.2 EEE

    Workload OLAP ~30GB, 22 Queries

    Partitions 12 Logical NodesDB2 Tablespace Configuration

    Tablespace Storage Arrays Tablespace

    Page Size

    Size NODE

    DATA1 SSA160 12 16KB 30GB 0-11

    DATA2 SSA160 12 16KB 15GB 0-11

    DATA3 SSA160 1 4KB < 1GB 0

    TEMP SSA160 12 4KB 60GB 0-11

    Table 4: System Configuration for Direct I/O results

    5.3 Performance Results

    The performance of DMS RAW tablespaces was compared with that of SMS filetablespaces with and without Concurrent I/O and DMS file tablespaces with and without

    Concurrent I/O. Results are reported in terms of tablespace types used by each setup.

    Tablespace Type Tablespace DDL

    DMS RAW

    CREATE TABLESPACE

    MANAGED BY DATABASE

    USING DEVICE

    DMS file CIO

    CREATE TABLESPACE

    MANAGED BY DATABASE

    USING FILE

    NO FILE SYSTEM CACHING

    SMS file CIO

    CREATE TABLESPACE

    MANAGED BY SYSTEMUSING

    NO FILE SYSTEM CACHING

    DMS file default (MMAP)

    CREATE TABLESPACE

    MANAGED BY DATABASE

    USING FILE

    FILE SYSTEM CACHING

    SMS file default (MMAP)

    CREATE TABLESPACE

    MANAGED BY SYSTEM

  • 8/6/2019 CIO Article

    17/25

    - 16 -

    USING

    FILE SYSTEM CACHING

    Table 5: DB2 Tablespace creation statements for tested container configurations

    Each test result shows the relative performance for each tablespace configuration. The

    DMS RAWtime is used as a baseline and the times of the other configurations are

    compared to it. For example, Figure 6shows that the SMS file MMAP results are 96% ofthat of theDMS RAWtime, so the overall run time using SMS file containers without

    Concurrent I/O was 4% slower thanDMS RAW.

    For each of the tests, vmstats were collected at 30 second intervals to profile the CPUutilization of the system throughout for each run. Vmstats breaks down the CPU

    utilization into four categories: %system, %user, %idle and %iowait. The

    aggregate %userand %system CPU utilization of each entire run is outlined for each of

    the tests. The difference in %system is a good indication of the extra overhead in systemCPU cycles consumed by file system buffered I/O as well as an indication of the

    performance gains that are possible using Concurrent I/O.

    5.3.1 Concurrent I/O I/O Bound Results

    The characteristics of the p690 system used for these tests was such that the sustainedthroughput from the FAStT900 and ESS storage systems plus connecting hardware was

    not high enough to deliver data from the disks fast enough to keep all eight processors

    busy. CPU usage statistics collected with vmstat indicated that 30% to 60% CPU(depending on container configuration) was spent in idle or in I/O wait. Although this test

    was run on an I/O bound system, runtime comparisons indicate that query performance of

    DB2 using DMS file or SMS containers on JFS2 utilizing Concurrent I/O does provide an

    improvement over DMS or SMS file MMAP I/O and matches the performance seenusing DMS raw containers. SMS file containers using default cached JFS2 resulted in the

    slowest performance at 4% slower than DMS raw.

  • 8/6/2019 CIO Article

    18/25

    - 17 -

    Figure 6 Figure 7

    Overall CPU utilization data collected throughout the 22 query run indicate that a

    significantly less amount of system CPU resources were used with raw logical volumesand JFS2 Concurrent I/O containers. Concurrent I/O on both SMS and DMS file

    tablespaces show a 2 3% increase in system CPU resources over DMS raw tablespaces

    while tablespaces using JFS2 with buffered I/O indicate that the system CPU utilizationwas over 25% higher than raw. Clearly, the use of Concurrent I/O to bypass the file

    system cache frees up significant processing resources. Figure 7indicates that while

    roughly the same %user CPU was used for each container setup, RAW containers and

    containers using Direct I/O used approximately 25% less system CPU cycles. Since themajority of the queries were I/O constricted, the extra CPU cycles freed up by using

    Concurrent I/O did not result in significant performance gains.

    In this I/O bound scenario, the performance of the queries was almost entirely dependant

    upon the throughput that the storage systems and connected adapters could provide. Any

    extra processing power was wasted and therefore the surplus CPU resources provided bybypassing the file system cache did not translate into faster runtimes. Regardless, this

    highlights the potential gain that could be seen on a similar workload run on a CPU

    constricted system.

    5.3.2 Concurrent I/O CPU Bound Results

    The equivalent DSS workload was run on the same p690 system, but to simulate a CPUconstricted environment, rsets were utilized to use only four of the eight processors.

    In this set of runs, the system ran at 50% idle throughout since only four of the eight

    I/O Bound Workload - Relative

    Performance

    100%

    100%

    99%

    97%

    96%

    95%

    96%

    97%

    98%

    99%

    100%

    DMSRA

    W

    DMSfile

    CIO

    SMSfile

    CIO

    DMSfile

    SMSfile

    RelativePerformance

    I/O Bound Workload - CPU Usage

    32

    .6

    32

    .0

    32

    .2

    32

    .5

    34

    .1

    9.4

    11

    .2

    11

    .93

    5.7

    36

    .3

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    DMSRa

    w

    DMSfile

    CIO

    SMSfile

    CIO

    DMSfile

    SMSfile

    %C

    PU

    %user %system

  • 8/6/2019 CIO Article

    19/25

    - 18 -

    processors were utilized. This simulationdemonstrates the actual performance gains

    that are seen as a direct result of freeing up

    CPU resources by lowering the %systemCPU requirements. The CPU usage in

    Figure 10 has been adjusted to reflect thefact that only half of the system processingpotential was available.

    With a CPU bound run, the processing

    power of the system is the majorperformance determining factor. On

    average, I/O throughput from the storage

    systems is somewhat less than that of theI/O bound runs since the throughput is

    limited by the speed the processors are able

    to process data as opposed to thethroughput the storage hardware allows.

    Comparing the I/O throughput of DMS raw

    on CPU bound and I/O bound test runs,

    throughput of the I/O bound runs averages20% higher than that of the CPU bound run

    (Figure 8).

    The CPU bound test results show the most dramatic performance improvements of CIOover buffered file systems. Here, the additional system% CPU cycles provided by

    bypassing the file system cache allow for more cycles for user% and other system%

    processes and hence result in faster runtimes. The most significant improvement is seencomparing SMS file without CIO to SMS file CIO where overall runtime performance

    improves approximately 24%. Runtimes of both SMS and DMS file using Concurrent I/O

    show performance comparable to DMS RAW (Figure 9).

    Read Throughput - I/O bound vs

    CPU Bound

    0

    20000

    40000

    60000

    80000

    100000

    120000

    140000

    160000

    180000

    Time

    KBps

    DMS RAW - I/O Bound

    DMS RAW - CPU Bound

    Figure 8: Differences in I/O throughput

    between I/O bound and CPU bound runs.

  • 8/6/2019 CIO Article

    20/25

    - 19 -

    Figure 9 Figure 10

    Similar to the I/O bound tests, the CPU bound system utilization shown in Figure 10indicate that the %system CPU utilization for the buffered I/O file system containers was

    over 25% higher that the CIO runs. Also, as with the I/O bound results, the %system

    usage for SMS and DMS file CIO was 2 3% higher than DMS RAW. The main

    difference appears in the %usercomparisons where an extra 5 7% CPU utilization isseen using CIO. Compare this to the I/O bound results where the %userfor all container

    configurations was roughly equal. The performance gains of the CIO runs highlight the

    impact of removing the processor overhead of the file system cache so that moreresources can be used by DB2.

    CPU Bound Workload - Relative

    Performance

    100%

    99%

    99%

    83%

    76%

    70%

    75%

    80%

    85%

    90%

    95%

    100%

    DMSRA

    W

    DMSfile

    CIO

    SMSfile

    CIO

    DMSfile

    SMSfile

    RelativePerformance

    CPU Bound Workload - CPU Usage

    56

    .2

    56

    .9

    57

    .1

    52

    .2

    49

    .9

    15.8

    17.9

    18

    . 8 45

    .4

    44

    .4

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    DMSRa

    w

    DMSfile

    CIO

    SMSfile

    CIO

    DMSfile

    SMSfile

    %C

    PU

    %user %system

  • 8/6/2019 CIO Article

    21/25

    - 20 -

    5.3.3 Direct I/O Results on JFS File Systems

    The final test results were collected on asystem configured to meet the requirements

    for Direct I/O. For the Direct I/O results, DMS

    RAW performance was compared with that ofSMS file containers on JFS with and without

    Direct I/O. As mentioned previously, in order

    to successfully use Direct I/O, DB2 containerson non-large file enabledJFS were required as

    well as the use of block based buffer pools.

    The test system running the Direct I/O resultswas I/O constricted.

    Results of the Direct I/O tests show that SMS

    file DIO does provide a performance gain over

    SMS file buffered but the extent of theimprovement is not as significant as what was

    seen in the Concurrent I/O test results. In thesetests, SMS file using Direct I/O improved

    performance by approximately 5% (Figure 11).

    6 Conclusion

    Both Concurrent I/O and Direct I/O show noticeable performance improvements over

    their buffered counterparts. For tablespace containers on JFS, Direct I/O will provide a

    performance improvement over buffered but, because of the limitations that restrict theuse of Direct I/O, Concurrent I/O using JFS2 is recommended. If Direct I/O is used on a

    configuration where the restrictions are not met, performance will be subject to very largedegradations.

    Concurrent I/O provides significant performance gains due, in a large part, to lower

    system CPU utilization. By bypassing the file system cache, processing resources thatwould otherwise be used to perform memory to memory copies and to write out dirty

    pages, are made available by enabling Concurrent I/O. Since DB2 performs file caching

    on its own, any benefits of maintaining a file system cache are eclipsed by theperformance that can be gained by using file system caching resources itself.

    Performance of DMS RAW containers has long shown this to be true. With ConcurrentI/O, the benefits of SMS or DMS file tablespaces can be realized without sacrificingperformance.

    Direct I/O:

    Relative Performance

    100%

    87%

    82%

    50%

    55%

    60%

    65%

    70%

    75%

    80%

    85%

    90%

    95%

    100%

    DMSRAW

    SMSfile

    DIOSM

    Sfile

    RelativePerformance

    Figure 11

  • 8/6/2019 CIO Article

    22/25

    - 21 -

    7 Appendix

    7.1 Appendix A: Direct I/O Limitations

    In order to effectively use DB2 with Direct I/O on Journaled File Systems (JFS), there are

    two separate issues that must be addressed.

    Coalescing of Vectored I/Os

    Direct I/O buffer size and alignment

    7.1.1 Non-coalesced Vectored I/O results in poor throughput

    First, there is a known AIX problem that involves vectored I/O on JFS Direct I/O where avectored read/write call (i.e., readv) is not coalesced. For example, when a readv() call of

    size 256KB is dispatched, the call will result in sixteen 16KB transfers instead of one

    256KB transfer. Clearly, this behavior has a very large impact on read performance due,in part, to the higher disk transaction rate and the smaller kilobyte per transaction of reads.

    This problem also applies to Concurrent I/O on JFS2 but has been addressed in AIX

    APAR IY49346. By default, if an OS supports vectored access, DB2 prefetchers will use

    it. As of version 8, DB2 provides block based bufferpools as an enhancement toprefetching. With a block based bufferpools, multiple pages are prefetched in a single I/O,

    reducing system CPU overhead and improving prefetch performance. On AIX, instead ofusing vectored I/O to prefetch sequential data, block read calls usepread(). The use of

    block based bufferpools with tablespace containers using JFS and Direct I/O will limit

    the negative effects of the non-coalescing read/write issue.

    7.1.2 Direct I/O buffer requirements

    The second issue applies specifically to large file enabledJFS. Direct I/O on JFS has

    requirements for buffer size, offset and alignment that differ from large file enabledtoregular file systems. With large file enabledJFS the minimum size requirement for buffer

    is very high (128KB). While DB2 prefetchers can be configured to insure that prefetchread calls are at least this minimum size, any non-sequential prefetch that read a single

    page will not meet this requirement since the maximum pagesize available for a

    tablespace is 32KB. A read or write that does not meet the Direct I/O requirements will

    not fail, but will fall back to file system buffered I/O, and after the data is transferred tothe application buffer, the cached copy is discarded. This also incurs a heavy penalty as

    any cached file data must be flushed out to disk. Standalone read throughput test results

    shown in Figure 12 illustrate how read performance degrades when this occurs.

  • 8/6/2019 CIO Article

    23/25

    - 22 -

    Throughput at various read blocksizes: Direct I/O vs Concurrent I/O

    0.00

    50.00

    100.00

    150.00

    200.00

    250.00

    300.00

    350.00

    byte

    s/blk

    1024

    2048

    4096

    8192

    1638

    4

    3276

    8

    6553

    6

    1310

    72

    2621

    44

    524288

    104857

    6

    2097

    152

    4194

    304

    read blocksize (bytes)

    MB/s JFS DIO

    JFS2 CIO

    Figure 12: Sequential Read Throughput of JFS Direct and JFS2 Concurrent I/O

    The tests run on FAStT900 storage compared read throughputs of Direct I/O and

    Concurrent I/O mounted file systems using different transfer sizes. Data was read

    concurrently from sixteen large file enabled file systems. With Direct I/O, when readswere attempted using buffer sizes smaller than 128KB, file access fell back to buffered

    I/O at 4KB per transaction without sequential read ahead.

    Concurrent I/O on JFS2 also has access requirements but the minimum transfer size andfile offset alignment restrictions are the same as Direct I/O with non large file enabled

    JFS and are not problematic with DB2 prefetching. For all the Direct I/O and Concurrent

    I/O requirements, see Table 7.

    Table 7: Direct I/O access requirements

    File System Format Maximum Transfer Size Minimum Transfer Size File Offset

    Alignment

    JFS Direct I/O: 4KB block 2MB 4KB 4KB

    JFS Direct I/O: large file enabled 2MB 128KB 4KB

    JFS2 Concurrent I/O 4GB 4KB 4KB

  • 8/6/2019 CIO Article

    24/25

    - 23 -

    8 References

    DB2 Certification Guides series of books

    AIX 5L Performance Management Guide

    DB2 admin guide (especially the performance book).

    DB2 Information Management - Libraryhttp://www-3.ibm.com/software/data/pubs/

    Redbooks on DB2 at http://www.redbooks.ibm.com

    9 Acknowledgement

    Sujatha Kashyap, Bret Olszewski, Richard Hendrickson. Improving Database

    Performance With AIX Concurrent I/O. IBM Corporation 2003.

    10 Glossary

    CIO Concurrent I/O. See page 6, section 2.2.4

    DIO Direct I/O. See page 5, section 2.2.3

    DMS

    Database Managed Space, a type of storage model consists of alimited number of devices or files whose space is managed by DB2UDB. The database administrator decides which devices and files touse, and DB2 UDB manages the space on those devices and files.

    I-node

    A data structure associated with each file which contains all theinformation necessary for a process to access the file, such as fileownership, access rights, file size, time of last access or modification,and the location of the files data on disk.

    MMAP I/O Memory Mapped I/O. See page 5, section 2.2.2.

    SMS

    System Managed Space, a type of storage model consists of manyfiles, representing table objects, stored in the file system space. Theuser decides on the location of the files, DB2 UDB controls theirnames, and the file system is responsible for managing them.

    Sync daemonA UNIX process (/usr/sbin/syncd) that runs at a fixed interval (thedefault is every 60 seconds). Its task is to force a write of dirty(modified) pages in the file buffer cache out to disk.

  • 8/6/2019 CIO Article

    25/25

    Copyright IBM Corporation 2004IBM Canada8200 Warden Avenue,Markham, ONL6G 1C7Canada

    Printed in United States of AmericaAll Rights Reserved.

    IBM, the IBM logo, DB2, DB2 UniversalDatabase, and AIX, AIX5L, pSeries aretrademarks of the International BusinessMachines Corporation in the United States,other countries or both.

    UNIX is a registered trademark of The OpenGroup in the United States and other countries.Other company, product, and service namesmay be trademarks or service marks of others.

    References in this publication to IBM productsor services do not imply that IBM intends tomake them available in all countries in whichIBM operates.

    All performance estimates are provided AS ISand no warranties or guarantees are expressedor implied by IBM. Users of this documentshould verify the applicable data for theirspecific environment.