CIO Article

8/6/2019 CIO Article

1/25

August 2004

Improve Database Performance on

File System Containers in IBM DB2Universal DatabaseV8.2

using Concurrent I/O on AIX

Kenton DeLathouwerandAlan Y LeeIBM Software Group

Punit ShahIBM pSeries Solution Enablement Group


2/25

Contents

1 Introduction................................................................................................................. 2

2 Background................................................................................................................. 3

2.1 DB2 Storage Model ............................................................................................ 32.2 I/O modes available with SMS and DMS files on File System.......................... 4

2.2.1 File system buffered I/O ............................................................................. 4

2.2.2 Memory mapped I/O................................................................................... 5

2.2.3 Direct I/O .................................................................................................... 62.2.4 Concurrent I/O ............................................................................................ 6

2.3 I/O modes available with DMS on Raw Device................................................. 8

2.3.1 Raw I/O....................................................................................................... 83 Enabling Direct I/O or Concurrent I/O in DB2 UDB V8.2 ........................................ 8

3.1 Enablement at Tablespace Level ........................................................................ 8

3.2 Enablement at File System Level ....................................................................... 9

4 Recommended OS Maintenance Levels & Fixes ....................................................... 95 Performance Tests..................................................................................................... 10

5.1 DB2 Performance Considerations using Direct I/O and Concurrent I/O ......... 105.1.1 Disk Throughput ....................................................................................... 11

5.1.2 CPU Utilization......................................................................................... 12

5.1.3 Memory Utilization................................................................................... 13

5.2 Test System Configuration ............................................................................... 145.3 Performance Results ......................................................................................... 15

5.3.1 Concurrent I/O I/O Bound Results ........................................................ 16

5.3.2 Concurrent I/O CPU Bound Results...................................................... 175.3.3 Direct I/O Results on JFS File Systems.................................................... 20

6 Conclusion ................................................................................................................ 207 Appendix................................................................................................................... 21

7.1 Appendix A: Direct I/O Limitations................................................................. 21

7.1.1 Non-coalesced Vectored I/O results in poor throughput .......................... 21

7.1.2 Direct I/O buffer requirements.................................................................. 21

8 References................................................................................................................. 239 Acknowledgement .................................................................................................... 23

10 Glossary ................................................................................................................ 23


3/25

- 2 -

1 Introduction

Database server performance, typically measured in terms of the transaction throughput

and application response time, depends heavily on the I/O subsystem performance. To

attain the best possible I/O throughput, database table data layout demands specialattention from database and system administrators. Today this is more pressing than it hasever been, considering the ever-widening gap between processor and disk speed. Chosen

data layout has great impact on the manageability and extensibility of the database

storage requirements.

Prior to the Version 8.2 release, DB2 UDB performed I/O operations in one of the threemodes:

File systems with buffered I/O

File systems with memory mapped I/O (AIX only)

Raw I/O.

For years, raw I/O (or raw devices) has been the preferred choice for I/O intensive

database workloads mainly because of its superior performance. Its remarkable

performance is due to the fact that it bypasses the caching and locking mechanisms that

are normally associated with file systems. It offers a direct route to the physical device.However, raw I/O presents significant challenges for storage management. Conversely,

file system storage is easy to manage and extend, but this ease of management comes at

the cost of lower performance, especially for the database workloads.

In recent years, file system vendors have developed an alternate I/O mechanism, called

Direct I/O (DIO), which aims at eliminating one of the performance bottlenecks fordatabase applications. The concept behind Direct I/O is similar to raw I/O in the sense

that they both bypass caching at the file system level. This reduces CPU overhead and

makes more memory available to the database instance, which can make more efficient

use of it for its own purposes.

While Direct I/O has reduced the gap of performance tradeoffs between file systems and

raw devices, some of the fundamental features of file systems, such as ensuring dataintegrity, still impact database performance. In 2003, IBM introduced a new file system

feature called Concurrent I/O (CIO) for the Enhanced Journaling File System (JFS2) in

AIX 5L version 5.2.10 maintenance update. This feature includes all the advantages of

Direct I/O and also relieves the serialization of write accesses. Database performanceachieved using CIO with JFS2 is comparable to that obtained by using raw I/O.

Starting with the Version 8.2 release, DB2 UDB supports DIO/CIO on AIX, and DIO on

HP, Solaris, Linux, and Windows. The focus of this article is on AIX only. The

background section first discusses various application I/O models available on AIX. It is

then followed by a description of how DB2 takes advantage of the CIO feature.


4/25

- 3 -

Subsequent sections include our performance test environment, configuration, results,and conclusion with recommendations.

2 Background

2.1 DB2 Storage Model

For each database created, DB2 UDB organizes storage in a hierarchy of logical storage

management units called containers and tablespaces. Database table data and metadata(indexes) are stored in specified tablespaces or in default tablespaces. The tablespace is

made up of one or more containers (a container cannot be a part or more than one

tablespace, however).

A container is a physical disk storage unit, such as an AIX logical volume, file systemfile or a directory. If a tablespace contains more than one container, then DB2 stripes data

across the containers. A database can have as many tablespaces as needed with somesystem-placed limits on the upper bound.

There are two types of tablespaces:

System Managed Space (SMS)

Database Managed Space (DMS)

In SMS tablespace, the operating systems file system manager allocates and manages the

space where the table is stored. The storage model typically consists of many filesrepresenting table objects, stored in the file system space. Storage space in the SMS

tablespace is allocated as needed.

On the other hand, for the DMS tablespace, DB2 database manager controls the storage

space. This storage model consists of a number of devices or operating system filesystem files whose space is managed by DB2 UDB. The database administrator decides

which devices and files to use, and DB2 UDB manages the space on those. Storage space

in the DMS tablespace is pre-allocated. The DMS tablespace is essentially animplementation of a special purpose file system designed to best meet the needs of the

DB2 database manager.

Every database has unique I/O characteristics for its tablespaces' underlying data files. Itis essential that each database I/O be optimized for maximum performance. A default

I/O solution implemented for all tablespaces may not be acceptable. The following

sections describe various I/O modes available in DB2 UDB.


5/25

- 4 -

2.2 I/O modes available with SMS and DMS files on File System

2.2.1 File system buffered I/O

At the most basic level, a file is simply a collection of bits stored on persistent media.When a process needs to access data from a file, the operating system brings the data into

main memory, where the process can examine it, alter it, and then request that the data besaved to disk. The operating system could read and write data directly to and from the

disk for each request, but the response time and throughput would be poor due to slowdisk access times. The operating system therefore attempts to minimize the frequency of

disk accesses by buffering data in main memory, within a structure called the file system

buffer cache.

On a file read request, the file system first attempts to read the requested data from the

buffer cache. If the data is not already present in the buffer cache, it is read from diskand cached in the buffer cache. Figures 1 and 2 show the sequence of actions that take

place when a read request is issued under this caching policy.

Figure 1 - Read data page, file system buffer

cache hitFigure 2 - Read data page, file system buffer

cache miss

Similarly, writes to a file are cached so that future reads can be satisfied without

necessitating a disk access, and to reduce the frequency of disk writes. The use of a filesystem buffer cache can be extremely effective when the cache hit rate is high. It also

enables the use of sequential read-ahead and write-behind policies to reduce thefrequency of physical disk I/Os. Another benefit is in making file writes asynchronous,since the application can continue execution without waiting for the disk write to

complete. Figure 3 shows the sequence of actions for a write request under cached I/O.

Application buffer Application

File buffer cache

Disk

1

2

3

45

6

1. Application issues a read request2. Kernel looks for requested data in the file buffer cache

3. Requested data not present in file buffer cache4. Kernel reads data from disk5. Read data is cached in file buffer cache6. Read data is copied from the file buffer cache to the

application buffer

K

ER

NE

L

1. Application issues read request2. Requested data found in file buffer cache3. Requested data copied over to application buffer


File buffer cache

Disk

1

23 K

ER

NE

L


6/25

- 5 -

While the file system buffer cacheimproves I/O performance, it also

consumes a significant portion of system

memory. AIXs Enhanced JFS, also knownas JFS2, allows the system administrator to

control the maximum amount of memorythat can be used by the file system forcaching. JFS2 uses a certain percentage of

real memory for its file system buffer cache,

specified by the maxclient% parameter;

maxclient% value can be tuned via the

vmo command.

2.2.2 Memory mapped I/O

For the memory mapped I/O (MMAP I/O), the file data of a file being accessed is

mapped into address space of a process (that is, a DB2 agent). A file can be mapped intothe address space of more than one process, enabling concurrent read or write access of

the file. A process, potentially can map multiple files into its address space. An

application must ensure file data integrity if a file is simultaneously mapped by multiple

processes.

Once a file is mapped into an address space of a process, subsequent access of the file

data by a process is carried out just as accessing any other memory location. MMAP I/Ois subjected to JFS2 file system buffering and prefetching just like regular file system I/O.

For these reasons, JFS2 does not support MMAP I/O with Direct I/O or its variant

Concurrent I/O. Concurrent I/O is discussed later in this section.

The MMAP I/O has following advantages:

Parallel write access of file data without inode exclusive write locks.

File data is accessed without expensive read() and write() system calls1.

By default, MMAP I/O is on for file system containers (provided file system is not

mounted with o cio or o dio option) to improve performance. MMAP I/O for DB2

UDB is controlled by DB2_MMAP_READ and DB2_MMAP_WRITE registryvariables. These can be turned off manually by setting them to OFF.

1 Elimination of read(), write() system calls doesnt mean elimination of disk read, write operations.

Somebody has to perform those reads and writes, for memory mapped I/O AIX5L kernel performs disk

accesses without application initiating it through read() or write() system calls.

Figure 3 - File system buffered write


File buffer cache

Disk

1

4

5

2

1. Application issues write request2. Kernel copies data from application buffer to file buffer

cache

3. Application continues execution, without waiting for diskwrite.

4. Periodic flushing of dirty file buffer cache pages initiatedby syncd

5. Dirty pages written to disk

K

E

RNE

L

3


7/25

- 6 -

2.2.3 Direct I/O

Certain classes of applications derive no benefit from the file system buffer cache. Some

technical workloads, for instance, never reuse data due to the sequential nature of theirdata access, resulting in poor buffer cache hit rates. DB2 UDB manages data caching

using bufferpools, so it does not need the file system caching capability. The use of a file

system buffer cache results in undesirable overheads in such cases, since data is firstmoved from the disk to the file system buffer cache and from there to the application

buffer. This double-copying of data results in additional CPU consumption. Also, the

duplication of application data within the file system buffer cache increases the amountof memory used for the same data, making less memory available for the application, and

resulting in additional system overheads due to memory management.

For applications that wish to bypass the buffering of memory within the file system cache,

Direct I/O is provided as an option on AIX (in both JFS and JFS2). When Direct I/O is

used for a file, data is transferred directly from the disk to the application buffer, without

the use of the file system buffer cache.

To avoid consistency issues, if multiple processes open the same file and one or more

processes did not specify O_DIRECT while others did, the file stays in the normal cachedI/O mode. Similarly, if the file is mapped in memory through the shmat() or mmap()

system calls, it stays in normal cached mode. Once the last conflicting, non-direct access

is eliminated (by using the close(), munmap(), or shmdt() system calls), the file ismoved into Direct I/O mode. The change from caching mode to Direct I/O mode can be

expensive because all modified pages in memory will have to be flushed to disk at that

point.

2.2.4 Concurrent I/O

Both buffered I/O and Direct I/O have the disadvantage of requiring a per-file write lock,or inode lock. This file system feature helps ensure data integrity and improve fault

tolerance, but also poses performance bottlenecks for database applications. The details

of inode locking are described below.

Inode Locking

While an application views a file as a contiguous stream of data, this is not actually how a

file is stored on disk. In reality, a file is stored as a set of (possibly non-contiguous)

blocks of data on disk. Each file has a data structure associated with it, called an inode.

The inode contains all the information necessary for a process to access the file, such as

file ownership, access rights, file size, time of last access or modification, and the

location of the files data on disk. Since a files data is spread across disk blocks, theinode contains a table of contents to help locate this data. It is important to note the

distinction between changing the contents of an inode and changing the contents of a file.


8/25

- 7 -

The contents of a file only change on a write operation. The contents of an inode changewhen the contents of the corresponding file change, or when its owner, permissions, or

any of the other information that is maintained as part of the inode changes. Thus,

changing the contents of a file automatically implies a change to the inode, whereas achange to the inode does not imply that the contents of the file have changed. Since

multiple threads may attempt to change the contents of an inode simultaneously, thiscould result in an inconsistent state of the inode. In order to avoid such race conditions,the inode is protected by a lock, called the inode lock. This lock is used for any access

that could result in a change to the contents of the inode, preventing other processes from

accessing the inode while it is in a possibly inconsistent state.

The inode lock imposes write serialization at the file level. Serializing write accesses

ensures that data inconsistencies due to overlapping writes do not occur. Serializing

reads with respect to writes ensures that the application does not read stale data.Sophisticated database applications implement their own data serialization, usually at a

finer level of granularity than the file. Such applications implement serialization

mechanisms at the application level to ensure that data inconsistencies do not occur, andthat stale data is not read. Consequently, they do not need the file system to implement

this serialization for them. The inode lock actually hinders performance in such cases, by

unnecessarily serializing non-competing data accesses.

For such applications, AIX 5.2 ML01 and later versions offers the Concurrent I/O (CIO)

option. Under Concurrent I/O, multiple threads can simultaneously perform reads and

writes on a shared file. This option is intended primarily for relational databaseapplications, most of which will operate under Concurrent I/O without any modification.

Applications that do not enforce serialization for accesses to shared files should not useConcurrent I/O, as this could result in data corruption due to competing accesses.

Under Concurrent I/O, the inode lock is acquired in read-shared mode for both read andwrite accesses. However, in situations where the contents of the inode may change for

reasons other than a change to the contents of the file (writes), the inode lock is acquired

in write-exclusive mode. One such situation occurs when a file is extended or truncated.Extending a file may require allocation of new disk blocks for the file, and consequently

requires an update to the table of contents of the corresponding inode. In this case, the

read-shared inode lock is upgraded to the write-exclusive mode for the duration of the

extend operation. Similarly, when a file is truncated, allocated disk blocks might be freedand the inodes table of contents needs to be updated. Upon completion of the extend or

truncate operation, the inode lock reverts to read-shared mode. This is a very powerful

feature, since it allows files using Concurrent I/O to grow or shrink in a manner that istransparent to the application, without having to close or reopen files after a resize.

Another situation that results in the inode lock being acquired in write-exclusive modeoccurs when an I/O request on the file violates the alignment or length restrictions of

Direct I/O. Alignment violations result in normal cached I/O being used for the file, and

the inode lock reverts to the read-shared, write-exclusive mode of operation.


9/25

- 8 -

2.3 I/O modes available with DMS on Raw Device

2.3.1 Raw I/O

Raw I/O bypasses the caching and locking mechanisms that are normally associated withfile systems. Raw devices are physical devices that offer a direct route to the device and

more control over the timing of the I/O to that device. The drawback of Raw I/O is thefact that managing raw devices is considered to be more difficult than managing file

system.

3 Enabling Direct I/O or Concurrent I/O in DB2 UDBV8.2

As mentioned in previous sections, some of the I/O modes are mutually exclusive witheach other. For example, a file opened in CIO mode cannot be memory mapped without

first closing it. In the Version 8.2 release, DB2 UDB engine turns off theDB2_MMAP_READ and DB2_MMAP_WRITE registry variables automatically if CIOmode is detected.

3.1 Enablement at Tablespace Level

New keywords NO FILE SYSTEM CACHING and FILE SYSTEM CACHING havebeen introduced to the CREATE and ALTER TABLESPACE SQL statements to allow

users to specify whether DIO/CIO is to be used for each tablespace.

When NO FILE SYSTEM CACHING is in effect for a tablespace, DB2 UDB alwaysattempts to use CIO wherever possible. In cases where CIO is not supported (for

example, if JFS is used), DIO will be used instead. Examples of the new SQL syntax areshown below:

Example 1: CREATE TABLESPACE ...

By default, this new tablespace will be using buffered I/O, the "FILE SYSTEM

CACHING" token is implied.

Example 2: CREATE TABLESPACE ... NO FILE

SYSTEM CACHING

The new tokens "NO FILE SYSTEM CACHING" indicate that caching at the file

system level will be OFF for this particular tablespace.

Example 3: ALTER TABLESPACE ... NO FILE SYSTEM

CACHING


10/25

- 9 -

This method of enabling DIO/CIO provides users control of the I/O mode at tablespacelevel. All containers within the same tablespace share the same caching mode. This is

the recommended method of enabling DIO/CIO with the Version 8.2 release of DB2

UDB. It can be used for both SMS and DMS tablespaces except for the following:

SMS large files (LF) SMS large object files (LOB)

SMS/DMS temporary tablespaces

The NO FILE SYSTEM CACHING keywords are ignored for the above cases.

3.2 Enablement at File System Level

An alternate method for enabling DIO/CIO is to use special mount options when

mounting the JFS or JFS2 file systems. Table 1 shows the syntax for it:

File System Types Mount commands

Direct I/O (JFS) mount -o dio

Concurrent I/O (JFS2) mount -o cio

Table 1: Direct I/O and Concurrent I/O mount commands

When a file system is mounted with the o dio/cio options, all files in the file system use

DIO/CIO by default. DIO/CIO can be restricted to a subset of the files in a file system by

placing the files that require DIO/CIO in a separate subdirectory and using namefs to

mount this subdirectory over the file system. For example, if a file system somefs

contains some files that prefer to use DIO/CIO and others that do not, we can create asubdirectory, subsomefs, in which we place all the files that require DIO/CIO. We can

mount somefs without specifying o dio, and then mount subsomefs as a namefs filesystem with the o dio option using the command:

mount v namefs o dio /somefs/subsomefs /someotherfs.

This method differs from the previous one in the sense that this can be done outside ofDB2 UDB and if all the containers reside on the same file system, this is a much simpler

method without changing any of the existing tablespace creation scripts.

4 Recommended OS Maintenance Levels & Fixes

The following table shows the recommended AIX maintenance levels and specific fixes

for DIO/CIO support in the DB2 UDB V8.2 release.


11/25

- 10 -

Platforms File System Types Recommended Fixes I/O mechanism

AIX 4.3.3 + JFS None Direct I/O

AIX 5.1+ JFS, JFS2 None Direct I/O

AIX 5.2 + JFS None Direct I/O

AIX 5.2 + JFS2 Maintenance Level 3 and

additional AIX APARs *

Concurrent I/O**

Table 2: Recommended levels on AIX to support DIO/CIO

* Please refer to the "Known issues for DB2 Universal Database on AIX 5.2" web pagefor a list of specific APARs required:

http://www-

1.ibm.com/support/docview.wss?rs=71&context=SSEPGG&q1=IY49385&uid=swg21165448&loc=en_US&cs=utf-8&lang=en+en

**DB2 uses CIO whenever it is possible. There is no support to use DIO on levels whereCIO is supported.

5 Performance Tests

As previously mentioned in section 2.2.2, MMAP I/O is the default I/O mode used in

DB2 UDB on AIX. MMAP I/O still caches data at file system level but avoids the i-nodecontention problem. It has proved to be superior to using the standard buffered I/O.

Hence, there are no performance tests done with buffered I/O alone. Any mentioning of

buffered I/O used in DB2 UDB in this section implies MMAP I/O.

5.1 DB2 Performance Considerations using Direct I/O andConcurrent I/O

Since Concurrent I/O implicitly invokes Direct I/O, all the performance considerations

for Direct I/O hold for Concurrent I/O as well. Thus, applications that benefit from

filesystem read-ahead, or have a high filesystem buffer cache hit rate, would probably seetheir performance deteriorate with Concurrent I/O, just as it would with Direct I/O. On

AIX 5.2.10, Concurrent I/O is the preferred method since it is not subject to the

limitations of buffer size and alignment that limit Direct I/Os effectiveness on JFS andpre AIX 5.2.10 JFS2 file systems.

The benefits of Direct I/O and Concurrent I/O with DB2 are evident in:


12/25

- 11 -

Disk throughput

System CPU utilization

Memory usage

With Direct I/O and Concurrent I/O, overall performance improvements are largely a

result of the freeing up system CPU cycles for use by the application. On I/O boundsystems, the performance gain is less dramatic since the application is waiting on datafrom the disks to continue processing and any extra CPU cycles are wasted.

5.1.1 Disk Throughput

Higher sequential read throughput is dependant, in part, on read transfer size andconsequently, the disk transaction rate. On AIX, with file system buffered I/O, maximum

sequential read throughput on JFS/JFS2 is limited by the largest transfer size the

sequential read aheadalgorithm will use. The virtual memory manager (VMM) willdetect when sequential read access to a file is occurring. As the sequential file access

continues, the read ahead algorithm will prefetch up to maxpgahead(j2_maxPageReadAhead on JFS2) pages. If the transfer size requested by theapplication is larger than the size of the JFS/JFS2 maximum read-ahead, then there will

be no extra benefit seen from file system read-ahead.

For example, with the default value ofmaxpgahead (j2_maxPageReadAhead) of 8,

using the default page size of 4096 bytes, the largest read ahead transfer size once the

algorithm ramps up will be 32KB. To achieve better throughput for sequential reads

using buffered file systems, the maximum read-ahead page count should be set to valuelarger than the default.

Read throughput tests comparing reads from raw logical volumes, Concurrent I/Omounted JFS2 and JFS2 buffered usingj2_maxPageReadAhead settings of 8 and 128

were compared in Figure 4. Results showed that read throughput using transfer sizes

larger than 256k resulted in roughly the same throughput for raw logical volumes,Concurrent I/O and for buffered I/O using aj2_maxPageReadAhead setting of 128.


13/25

- 12 -

FAStT Throughput

0.00

100.00

200.00

300.00

400.00

1024

2048

4096

8192

16384

3276

8

6553

6

131072

262144

524288

104857

6

209715

2

419430

4

838860

8

Transfer Size (bytes)

Throughput(MB/s)

RAW JFS2 8 JFS2 128 JFS2 CIO

Figure 4: Sequential Read Throughput using various Transfer Sizes

Each read/write application call with Direct I/O and Concurrent I/O will result in a disk

access. When smaller read transfer sizes are used, a higher transaction rate will result and

as the read transfer size requested by the application increases, the throughput onConcurrent I/O increases to the maximum throughput limit of the storage system. The

throughput of raw LV and Concurrent I/O mounted JFS2 is roughly equal for different

transfer sizes since both bypass the file system cache. For raw I/O, requests are handledat the LVM layer, bypassing the JFS/JFS2 entirely while Concurrent I/O mounted JFS2

achieves the same throughput characteristics by bypassing the file system cache. On

JFS/JFS2 using buffered I/O, max sequential read throughput is affected by the maximum

sequential prefetch size. With file system buffered I/O, read increments are determinedby the sequential read ahead algorithm irregardless of the transfer size requested by theapplication. Test results show that sequential read throughput using a value of 128 for

j2_maxPageReadAhead can achieve sustained throughput comparable to that of RAW

logical volumes.

In the test environments, DB2 prefetchers were configured to perform large reads

(128KB 256KB) in order to achieve the optimal read throughput. For the buffered JFS2tests, a value of 128 was used forj2_maxPageReadAhead.

5.1.2 CPU Utilization

With file system caching, the VMM performs CPU intensive, memory to memory movesfrom the file system cache to the application I/O buffer. In addition to this, the sync

daemon periodically checks the file system cache for dirty pages and writes them out to


14/25

- 13 -

disk. These services incur an overhead and additional system CPU cycles are required.However, some applications will see no benefit of maintaining a file system cache since

the data is not accessed again before it is swapped out. In the case of most large database

applications, file caching is handled bythe application. Applications such as

these that see little or no benefit fromfile system caching will see significantgains in performance as a result of the

extra CPU resources made available by

bypassing the file system cache entirely.

This is most relevant for CPU-constricted environments

When testing a CPU-bound query, SMSfile containers with Concurrent I/O and

RAW logical volume containers

indicate the same level of system CPUutilization. However, SMS containers

using buffered I/O use an average of

6% more system CPU. The

performance gain achieved as a result ofthe extra cycles yielded a 14% gain in

query run time (see Figure 5).

5.1.3 Memory Utilization

As with CPU utilization, file system caching requires additional memory resources asmemory frames are required for use as file cache. On systems where memory is

overcommitted, the benefit of having file system caching may not be worth the memory

resources needed for it. For example, in most large database applications, the database

software does its own I/O caching. If the file system cache hit ratio is very low, thebenefit of the file system cache is negligible. In these situations, memory can often bebetter utilized by the database application for sorting, prefetching, and so on.

The virtual memory manager (VMM) can be configured to only use a percentage of realmemory page frames for file pages. The vmo parameters (vmtune on AIX 5.1 andbelow) maxperm% and maxclient% (on JFS2) have default values of 80. With this

default setting, 80% of real memory frames can be used as file pages before the VMM

page stealing algorithm will only steal file frames. Normally, the VMM only steals filepages, but will steal computational pages if the repage rate is lower for them. For the test

workload, where the majority of the system memory will be utilized by DB2 for the

bufferpools and/or sortheap, this parameter was set to 20% in order to minimize the

System% CPU Utilization on CPU Bound

Query

0

4

8

12

16

Time

CPU%

RAW JFS2 Buffered JFS2 CIO

Figure 5: System CPU Utilization of CPU

constricted query


15/25

- 14 -

computational pages that were swapped out in favor of file pages and to avoid repaging.The advantage of using RAW logical volumes or direct/Concurrent I/O is that no

memory frames will be used for file pages as the file system cache is not being used.

The results were obtained from a DSS type workload where large tablescans (sequential

reads from tables) were common in the tested queries. The characteristics of thisworkload are such that the cache (application or OS level) is largely used for sequentialread-ahead. Test results will show that this type of workload sees large benefits from

using Concurrent I/O, as the JFS/JFS2 sequential read ahead is redundant because DB2

performs its own prefetching.

5.2 Test System Configuration

Tests were run using three different system configurations. For Concurrent I/O, the

system described in Table 3 was used to gather results for I/O and CPU constricted

environments. These tests were run on the same physical system. In order to simulate aCPU bound workload, the number of active CPUs was reduced by using AIX resource

sets to achieve a decrease in the CPU processing capabilities.

In order to test Direct I/O, results were gathered from the system configuration described

in Table 4. This system was configured to meet the requirements for Direct I/O. A

smaller database using containers on non - large file enabledJFS was used. With non -large file enabledfile systems, the JFS Direct I/O buffer size and alignment requirements

do not cause a problem with DB2. Additionally, to circumvent the non coalescing

vectored read/write issues using Direct I/O that affect JFS, block based bufferpools wereutilized. (Refer to Appendix A for details on Direct I/O requirements and its limitations).

For RAW logical volumes and file systems, three data and one temporary tablespacewere used. The DSS workload consisted of 8 tables, one of which accounted for ~70% of

the data. For fair comparisons between raw logical volume and file system tests, AIX

logical volumes were configured identically. For the tests on file system containers, an

equal number of JFS2 file systems were created that used the same physical disks as theraw logical volumes used in the raw tests.

Test Configuration #1: Concurrent I/O

System

Machine IBM p690

Processor 8 x PowerPC Power4 1300Mhz

Memory 32 GBCache L2 cache 1440 Kbytes

OS AIX5L 5.2 ML02 64bit

Storage

FAStT900 80 disks -16 x 4+p RAID5 Arrays

ESS Shark

DB2 Configuration

Version DB2 UDB v8.2 EEE

Workload OLAP ~100GB, 22 Queries

Partitions 8 Logical Nodes

DB2 Tablespace Configuration


16/25

- 15 -

Tablespace Storage Arrays Tablespace

Page Size

Size NODE

DATA1 FAStT900 16 16KB 100GB 0-7

DATA2 ESS Shark 16 16KB 48GB 0-7

DATA3 ESS Shark 1 4KB < 1GB 0

TEMP ESS Shark 16 4KB 200GB 0-7

Table 3: System Configuration for Concurrent I/O results

Test Configuration #2: Direct I/O

System

Machine

Processor 12 x PowerPC Power4 262Mhz

Memory 16 GB

Cache L2 cache 1440 Kbytes

OS AIX5L 5.2 ML02 64bit

Storage

SSA 24 x 18GB SSA160 disks

DB2

Version DB2 UDB v8.2 EEE

Workload OLAP ~30GB, 22 Queries

Partitions 12 Logical NodesDB2 Tablespace Configuration

Tablespace Storage Arrays Tablespace

Page Size

Size NODE

DATA1 SSA160 12 16KB 30GB 0-11

DATA2 SSA160 12 16KB 15GB 0-11

DATA3 SSA160 1 4KB < 1GB 0

TEMP SSA160 12 4KB 60GB 0-11

Table 4: System Configuration for Direct I/O results

5.3 Performance Results

The performance of DMS RAW tablespaces was compared with that of SMS filetablespaces with and without Concurrent I/O and DMS file tablespaces with and without

Concurrent I/O. Results are reported in terms of tablespace types used by each setup.

Tablespace Type Tablespace DDL

DMS RAW

CREATE TABLESPACE

MANAGED BY DATABASE

USING DEVICE

DMS file CIO

CREATE TABLESPACE

MANAGED BY DATABASE

USING FILE

NO FILE SYSTEM CACHING

SMS file CIO

CREATE TABLESPACE

MANAGED BY SYSTEMUSING

NO FILE SYSTEM CACHING

DMS file default (MMAP)

CREATE TABLESPACE

MANAGED BY DATABASE

USING FILE

FILE SYSTEM CACHING

SMS file default (MMAP)

CREATE TABLESPACE

MANAGED BY SYSTEM


17/25

- 16 -

USING

FILE SYSTEM CACHING

Table 5: DB2 Tablespace creation statements for tested container configurations

Each test result shows the relative performance for each tablespace configuration. The

DMS RAWtime is used as a baseline and the times of the other configurations are

compared to it. For example, Figure 6shows that the SMS file MMAP results are 96% ofthat of theDMS RAWtime, so the overall run time using SMS file containers without

Concurrent I/O was 4% slower thanDMS RAW.

For each of the tests, vmstats were collected at 30 second intervals to profile the CPUutilization of the system throughout for each run. Vmstats breaks down the CPU

utilization into four categories: %system, %user, %idle and %iowait. The

aggregate %userand %system CPU utilization of each entire run is outlined for each of

the tests. The difference in %system is a good indication of the extra overhead in systemCPU cycles consumed by file system buffered I/O as well as an indication of the

performance gains that are possible using Concurrent I/O.

5.3.1 Concurrent I/O I/O Bound Results

The characteristics of the p690 system used for these tests was such that the sustainedthroughput from the FAStT900 and ESS storage systems plus connecting hardware was

not high enough to deliver data from the disks fast enough to keep all eight processors

busy. CPU usage statistics collected with vmstat indicated that 30% to 60% CPU(depending on container configuration) was spent in idle or in I/O wait. Although this test

was run on an I/O bound system, runtime comparisons indicate that query performance of

DB2 using DMS file or SMS containers on JFS2 utilizing Concurrent I/O does provide an

improvement over DMS or SMS file MMAP I/O and matches the performance seenusing DMS raw containers. SMS file containers using default cached JFS2 resulted in the

slowest performance at 4% slower than DMS raw.


18/25

- 17 -

Figure 6 Figure 7

Overall CPU utilization data collected throughout the 22 query run indicate that a

significantly less amount of system CPU resources were used with raw logical volumesand JFS2 Concurrent I/O containers. Concurrent I/O on both SMS and DMS file

tablespaces show a 2 3% increase in system CPU resources over DMS raw tablespaces

while tablespaces using JFS2 with buffered I/O indicate that the system CPU utilizationwas over 25% higher than raw. Clearly, the use of Concurrent I/O to bypass the file

system cache frees up significant processing resources. Figure 7indicates that while

roughly the same %user CPU was used for each container setup, RAW containers and

containers using Direct I/O used approximately 25% less system CPU cycles. Since themajority of the queries were I/O constricted, the extra CPU cycles freed up by using

Concurrent I/O did not result in significant performance gains.

In this I/O bound scenario, the performance of the queries was almost entirely dependant

upon the throughput that the storage systems and connected adapters could provide. Any

extra processing power was wasted and therefore the surplus CPU resources provided bybypassing the file system cache did not translate into faster runtimes. Regardless, this

highlights the potential gain that could be seen on a similar workload run on a CPU

constricted system.

5.3.2 Concurrent I/O CPU Bound Results

The equivalent DSS workload was run on the same p690 system, but to simulate a CPUconstricted environment, rsets were utilized to use only four of the eight processors.

In this set of runs, the system ran at 50% idle throughout since only four of the eight

I/O Bound Workload - Relative

Performance

100%

100%

99%

97%

96%

95%

96%

97%

98%

99%

100%

DMSRA

W

DMSfile

CIO

SMSfile

CIO

DMSfile

SMSfile

RelativePerformance

I/O Bound Workload - CPU Usage

32

.6

32

.0

32

.2

32

.5

34

.1

9.4

11

.2

11

.93

5.7

36

.3

0

10

20

30

40

50

60

70

80

90

100

DMSRa

w

DMSfile

CIO

SMSfile

CIO

DMSfile

SMSfile

%C

PU

%user %system


19/25

- 18 -

processors were utilized. This simulationdemonstrates the actual performance gains

that are seen as a direct result of freeing up

CPU resources by lowering the %systemCPU requirements. The CPU usage in

Figure 10 has been adjusted to reflect thefact that only half of the system processingpotential was available.

With a CPU bound run, the processing

power of the system is the majorperformance determining factor. On

average, I/O throughput from the storage

systems is somewhat less than that of theI/O bound runs since the throughput is

limited by the speed the processors are able

to process data as opposed to thethroughput the storage hardware allows.

Comparing the I/O throughput of DMS raw

on CPU bound and I/O bound test runs,

throughput of the I/O bound runs averages20% higher than that of the CPU bound run

(Figure 8).

The CPU bound test results show the most dramatic performance improvements of CIOover buffered file systems. Here, the additional system% CPU cycles provided by

bypassing the file system cache allow for more cycles for user% and other system%

processes and hence result in faster runtimes. The most significant improvement is seencomparing SMS file without CIO to SMS file CIO where overall runtime performance

improves approximately 24%. Runtimes of both SMS and DMS file using Concurrent I/O

show performance comparable to DMS RAW (Figure 9).

Read Throughput - I/O bound vs

CPU Bound

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

Time

KBps

DMS RAW - I/O Bound

DMS RAW - CPU Bound

Figure 8: Differences in I/O throughput

between I/O bound and CPU bound runs.


20/25

- 19 -

Figure 9 Figure 10

Similar to the I/O bound tests, the CPU bound system utilization shown in Figure 10indicate that the %system CPU utilization for the buffered I/O file system containers was

over 25% higher that the CIO runs. Also, as with the I/O bound results, the %system

usage for SMS and DMS file CIO was 2 3% higher than DMS RAW. The main

difference appears in the %usercomparisons where an extra 5 7% CPU utilization isseen using CIO. Compare this to the I/O bound results where the %userfor all container

configurations was roughly equal. The performance gains of the CIO runs highlight the

impact of removing the processor overhead of the file system cache so that moreresources can be used by DB2.

CPU Bound Workload - Relative

Performance

100%

99%

99%

83%

76%

70%

75%

80%

85%

90%

95%

100%

DMSRA

W

DMSfile

CIO

SMSfile

CIO

DMSfile

SMSfile

RelativePerformance

CPU Bound Workload - CPU Usage

56

.2

56

.9

57

.1

52

.2

49

.9

15.8

17.9

18

. 8 45

.4

44

.4

0

10

20

30

40

50

60

70

80

90

100

DMSRa

w

DMSfile

CIO

SMSfile

CIO

DMSfile

SMSfile

%C

PU

%user %system


21/25

- 20 -

5.3.3 Direct I/O Results on JFS File Systems

The final test results were collected on asystem configured to meet the requirements

for Direct I/O. For the Direct I/O results, DMS

RAW performance was compared with that ofSMS file containers on JFS with and without

Direct I/O. As mentioned previously, in order

to successfully use Direct I/O, DB2 containerson non-large file enabledJFS were required as

well as the use of block based buffer pools.

The test system running the Direct I/O resultswas I/O constricted.

Results of the Direct I/O tests show that SMS

file DIO does provide a performance gain over

SMS file buffered but the extent of theimprovement is not as significant as what was

seen in the Concurrent I/O test results. In thesetests, SMS file using Direct I/O improved

performance by approximately 5% (Figure 11).

6 Conclusion

Both Concurrent I/O and Direct I/O show noticeable performance improvements over

their buffered counterparts. For tablespace containers on JFS, Direct I/O will provide a

performance improvement over buffered but, because of the limitations that restrict theuse of Direct I/O, Concurrent I/O using JFS2 is recommended. If Direct I/O is used on a

configuration where the restrictions are not met, performance will be subject to very largedegradations.

Concurrent I/O provides significant performance gains due, in a large part, to lower

system CPU utilization. By bypassing the file system cache, processing resources thatwould otherwise be used to perform memory to memory copies and to write out dirty

pages, are made available by enabling Concurrent I/O. Since DB2 performs file caching

on its own, any benefits of maintaining a file system cache are eclipsed by theperformance that can be gained by using file system caching resources itself.

Performance of DMS RAW containers has long shown this to be true. With ConcurrentI/O, the benefits of SMS or DMS file tablespaces can be realized without sacrificingperformance.

Direct I/O:

Relative Performance

100%

87%

82%

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

DMSRAW

SMSfile

DIOSM

Sfile

RelativePerformance

Figure 11


22/25

- 21 -

7 Appendix

7.1 Appendix A: Direct I/O Limitations

In order to effectively use DB2 with Direct I/O on Journaled File Systems (JFS), there are

two separate issues that must be addressed.

Coalescing of Vectored I/Os

Direct I/O buffer size and alignment

7.1.1 Non-coalesced Vectored I/O results in poor throughput

First, there is a known AIX problem that involves vectored I/O on JFS Direct I/O where avectored read/write call (i.e., readv) is not coalesced. For example, when a readv() call of

size 256KB is dispatched, the call will result in sixteen 16KB transfers instead of one

256KB transfer. Clearly, this behavior has a very large impact on read performance due,in part, to the higher disk transaction rate and the smaller kilobyte per transaction of reads.

This problem also applies to Concurrent I/O on JFS2 but has been addressed in AIX

APAR IY49346. By default, if an OS supports vectored access, DB2 prefetchers will use

it. As of version 8, DB2 provides block based bufferpools as an enhancement toprefetching. With a block based bufferpools, multiple pages are prefetched in a single I/O,

reducing system CPU overhead and improving prefetch performance. On AIX, instead ofusing vectored I/O to prefetch sequential data, block read calls usepread(). The use of

block based bufferpools with tablespace containers using JFS and Direct I/O will limit

the negative effects of the non-coalescing read/write issue.

7.1.2 Direct I/O buffer requirements

The second issue applies specifically to large file enabledJFS. Direct I/O on JFS has

requirements for buffer size, offset and alignment that differ from large file enabledtoregular file systems. With large file enabledJFS the minimum size requirement for buffer

is very high (128KB). While DB2 prefetchers can be configured to insure that prefetchread calls are at least this minimum size, any non-sequential prefetch that read a single

page will not meet this requirement since the maximum pagesize available for a

tablespace is 32KB. A read or write that does not meet the Direct I/O requirements will

not fail, but will fall back to file system buffered I/O, and after the data is transferred tothe application buffer, the cached copy is discarded. This also incurs a heavy penalty as

any cached file data must be flushed out to disk. Standalone read throughput test results

shown in Figure 12 illustrate how read performance degrades when this occurs.


23/25

- 22 -

Throughput at various read blocksizes: Direct I/O vs Concurrent I/O

0.00

50.00

100.00

150.00

200.00

250.00

300.00

350.00

byte

s/blk

1024

2048

4096

8192

1638

4

3276

8

6553

6

1310

72

2621

44

524288

104857

6

2097

152

4194

304

read blocksize (bytes)

MB/s JFS DIO

JFS2 CIO

Figure 12: Sequential Read Throughput of JFS Direct and JFS2 Concurrent I/O

The tests run on FAStT900 storage compared read throughputs of Direct I/O and

Concurrent I/O mounted file systems using different transfer sizes. Data was read

concurrently from sixteen large file enabled file systems. With Direct I/O, when readswere attempted using buffer sizes smaller than 128KB, file access fell back to buffered

I/O at 4KB per transaction without sequential read ahead.

Concurrent I/O on JFS2 also has access requirements but the minimum transfer size andfile offset alignment restrictions are the same as Direct I/O with non large file enabled

JFS and are not problematic with DB2 prefetching. For all the Direct I/O and Concurrent

I/O requirements, see Table 7.

Table 7: Direct I/O access requirements

File System Format Maximum Transfer Size Minimum Transfer Size File Offset

Alignment

JFS Direct I/O: 4KB block 2MB 4KB 4KB

JFS Direct I/O: large file enabled 2MB 128KB 4KB

JFS2 Concurrent I/O 4GB 4KB 4KB


24/25

- 23 -

8 References

DB2 Certification Guides series of books

AIX 5L Performance Management Guide

DB2 admin guide (especially the performance book).

DB2 Information Management - Libraryhttp://www-3.ibm.com/software/data/pubs/

Redbooks on DB2 at http://www.redbooks.ibm.com

9 Acknowledgement

Sujatha Kashyap, Bret Olszewski, Richard Hendrickson. Improving Database

Performance With AIX Concurrent I/O. IBM Corporation 2003.

10 Glossary

CIO Concurrent I/O. See page 6, section 2.2.4

DIO Direct I/O. See page 5, section 2.2.3

DMS

Database Managed Space, a type of storage model consists of alimited number of devices or files whose space is managed by DB2UDB. The database administrator decides which devices and files touse, and DB2 UDB manages the space on those devices and files.

I-node

A data structure associated with each file which contains all theinformation necessary for a process to access the file, such as fileownership, access rights, file size, time of last access or modification,and the location of the files data on disk.

MMAP I/O Memory Mapped I/O. See page 5, section 2.2.2.

SMS

System Managed Space, a type of storage model consists of manyfiles, representing table objects, stored in the file system space. Theuser decides on the location of the files, DB2 UDB controls theirnames, and the file system is responsible for managing them.

Sync daemonA UNIX process (/usr/sbin/syncd) that runs at a fixed interval (thedefault is every 60 seconds). Its task is to force a write of dirty(modified) pages in the file buffer cache out to disk.


25/25

Copyright IBM Corporation 2004IBM Canada8200 Warden Avenue,Markham, ONL6G 1C7Canada

Printed in United States of AmericaAll Rights Reserved.

IBM, the IBM logo, DB2, DB2 UniversalDatabase, and AIX, AIX5L, pSeries aretrademarks of the International BusinessMachines Corporation in the United States,other countries or both.

UNIX is a registered trademark of The OpenGroup in the United States and other countries.Other company, product, and service namesmay be trademarks or service marks of others.

References in this publication to IBM productsor services do not imply that IBM intends tomake them available in all countries in whichIBM operates.

All performance estimates are provided AS ISand no warranties or guarantees are expressedor implied by IBM. Users of this documentshould verify the applicable data for theirspecific environment.

Documents

CIO Article