Oracle I/O Supply and Demand

Oracle I/O:Supply and Demand

Bob Sneed ([email protected])

SMI Performance & Availability Engineering (PAE)

SUPerG @ Amsterdam, October, 2001

Rev. 1.1

Abstract

Storage performance topics can be meaningfullyabstracted in terms of the economic notions of"supply" and "demand" factors. The analogycan be extended into characterizing I/Oproblems as "shortages", and to the notion ofaddressing shortages with both supply−sideand demand−side strategies.

While there is often much that can be done withSQL tuning and schema design to get a giventask done with the minimal amount of I/O, thispaper focuses primarily on platform−levelconfiguration and tuning factors. A widevariety of such factors are discussed, includingdisk layouts, filesystem options, and key Oracleinstance parameters.

Practical advice is offered on configurationstrategies, tuning techniques, and measurementchallenges.

Introduction

The field of Economics offers a wealth ofopportunities for phrasing metaphors andanalogies relevant to Oracle database I/O.While it is easy to get carried away withexcessive creativity along these lines, the basicconcept of distinguishing supply issues fromdemand issues can actually be quite useful forexplaining many database I/O options andtradeoffs. This frame of reference is nosubstitute for proven analysis and tuningmethodologies, but it is hoped that it will proveuseful in the various processes of architecting,configuring, and tuning Oracle to the Solaris

Operating Environment (OE) and underlyingstorage equipment.

While Economics has created a wide variety ofmodeling strategies and computationaltechniques, the intention here is to steer awayfrom formalism and merely borrow someeconomic terminology at an abstract conceptuallevel. When I/O is involved in a problem at all,it may be possible to improve results withstrategies and tactics of either decreasing I/Odemand or increasing I/O supply.

After a few introductory metaphors, we willpresent several key Oracle demand topics, thensurvey many supply factors in the I/O stack ofthe Solaris OE − from the bottom up.

1. Economic Metaphors

Much of the language of Economics is easilyadaptable to discussion of I/O topics.

− Equilibrium

In Economics, the intersection of supply anddemand curves indicates a market equilibriumpoint. With I/O, statistics observed at the diskdriver level represent an empirical equilibriumof sorts based on the interaction of numeroussupply and demand factors. While a great dealof constructive tuning can be accomplished bystudying iostat data, these numbers do notreally illuminate demand characteristics muchat all past the point of observing channel or diskutilization or saturation.

When business throughput is meetingrequirements and expectations, equilibrium

1

measurements are interesting for capacityplanning and for spotting operationalanomalies. However, when businessthroughput is not satisfactory, equilibriummeasurements may not help clearly identifyoptions for improvement.

− Measuring Supply & Demand

In Economics, demand is said to vary withprice, and it is usually very difficult tocharacterize without conducting experiments.In this discussion, the notion of demand is beingused absent the notion of economic cost, but thedifficulty of characterization is similar to theeconomic case.

With I/O, the term I/O operations Per Second(IOPS) is frequently employed to characterizeboth empirical and theoretical throughputlevels. This statistic is often of limited utility, astheoretical and marketing IOPS numbers areoften difficult to match in practice, and thecircumstances under which they occur is notcited with any consistency.

Oracle STATSPACK1 reports are most useful fordetermining actual I/O statistics per file sinceits measurements incorporate queuing delaysand throttles from the filesystem and volumemanager layers, as well as possible read hitsfrom the OS page cache. In contrast, iostat orTrace Normal Form2 (TNF) data will only showwhat passes to the physical I/O layer of the OS.

Oracle’s measurements are per file, whileiostat measurements are per device.Untangling the mapping of files to volumes,channels, and devices can be a laborious task. Itmay add to the confusion that an OraclePHYSICAL I/O is a LOGICAL I/O to the OS,and thus may reflect cache hits in the OS pagecache. Indeed, to the degree that Oracle’s I/Ostatistics do not correlate with OS I/O statistics,the OS page cache provides the obviousexplanation.

Veritas offer statistics at the Volume Managerlevel for various objects, using its own meansof data capture and presentation. With theVeritas Filesystem’s Quick I/O (QIO) option, ameans is provided to observe per file OS page

1The ancestor of STATSPACK prior to Oracle 8 isBSTAT/ESTAT reporting.

2See prex(1) et al.

cache hits, once again using a uniquemechanism.

There exists no comprehensive instrumentationwhich spans all layers of the I/O stack. Someinteresting statistics, such as range of varianceat various levels, are not generally observable.There is certainly room for progress ininstrumenting the I/O stack to get a clearerpicture of what’s going on.

Oracle’s wait event statistics are the principaltool for observing what Oracle is waiting for,and how much stands to be gained byimprovements in specific areas. Guidance ininterpreting these statistics is offered in severaltexts, including [YAPP Method, 1999] and[ORA_9i_PER]. While these tools andtechniques undoubtedly provide the bestholistic approach to Oracle I/O optimization,close cooperation between DBA’s, SystemAdministrators and Storage Administrators isrequired. It is hoped that perspectives offeredhere will be useful in the furtherance of suchcollaborations.

− Quotas & Tariffs

The I/O stack contains a wide assortment ofthrottle points and fixed limits. Knowing wherethese are is the key to working with them orworking around them, just as everyone looksforward to shopping at duty−free stores.

− I/O Surplus

In Economics, surpluses are associated withlower prices and modern phenomena likeinventory write offs. With I/O, it would appearthat the main consequence of excess supply iswasted capital.

In practical terms, an excess I/O supply mayhave tangible advantages, such as allowingcertain application optimizations, or providingthe comfort of ’headroom’ from a configurationmanagement perspective. Certainly, surplussupply is better than a shortage, and properlyconfigured systems should possess some degreeof surplus for the sake of headroom alone.

− I/O Shortage

In Economics, the consequences of shortagesinclude higher prices, longer lines, and maybeeven famine or despair. With I/O, the simpleanalogue to higher prices is increased latency

2

due to queuing, and famine and despair mightbe terms with which a user could relate whensystem response is especially poor.

The only effect one normally expects from anI/O shortage is degraded response and longerrun times. However, worse things are possible.With Oracle, for example, when usingAsynchronous I/O (AIO), any request noted totake longer than 10 minutes usually3 results in afatal ORA−27062 incident. Failure to update acontrol file can cause a fatal error after 15minutes.

It is rare that timeouts like these occur purelydue to I/O shortages. They are normallyexpected only in the event of hardware failures.However, it would appear that the vastmajority of Oracle timeout events actually resultwith no hardware error present, and usually atload levels where one would not predict an I/Ocould possibly take 10 minutes.

− Supply Chain Failures

The I/O stack is like a supply chain, andfailures at lower levels cause a domino effect ofproblems through the higher layers. Hereagain, there is room for improvement ininstrumenting the I/O stack to simplifyisolation of problem areas.

− Death and Taxes

Hardware failures and error recovery canseriously skew performance, or even lead toORA−27062 timeout incidents. Measures takento insure against such failures, like RAID dataprotection or Oracle log file multiplexing, caneach take their toll − much like a tax or lifeinsurance premium. Hardware failures andcompromises to insure against them are just ascertain as death and taxes.

Certain bugs have been associated with ORA−27062 incidents, but perhaps the most commoncause arises from AIO to filesystem files due toan issue recently identified with the AIO libraryand the timesharing (TS) scheduler class4 bySun BugID 4468181. In the interest ofminimizing deaths from ORA−27062 incidents,

3As of Oracle 8.1.7.2, or with patches to selectedprior releases, the message "aiowait timed out" mayoccur 100 times before becoming fatal condition.

4See priocntl(1) and priocntl(2) .

some subsequent points of discussion willinclude their bearing on ORA−27062 exposure.

2. Oracle I/O Demand

Many design and tuning techniques are aimedat reducing I/O demands. Conventionalwisdom says that 80% of the gains possible foran application lie in these areas. Thisconventional wisdom often leads to neglect ofmany important factors at the Oracle andplatform levels. The discussion offered herelargely avoids tuning topics found in manybooks, such as database schema design,application coding techniques, data indexing,and SQL tuning.

Oracle Demand Characteristics

Different Oracle operations have different I/Odemand characteristics, and understandingthese is important to formulating usefulstrategies and tactics for tuning. Somevocabulary is introduced here for use insubsequent discussion of controlling Oracle I/Odemand.

− Synchronous Writes

All Oracle writes to data files are datasynchronous5 . That is, regardless of filesystembuffering, writes must be fully committed topersistent storage before Oracle considers themcomplete. This synchronous completion criteriais quite distinct and separate from whether ornot writes are managed as AIO requests.

− Latency, Bandwidth & IOPS

These three common and closely relatedmeasures of I/O performance each have theirplace.

The most latency sensitive I/O from an Oracletransaction perspective are physical I/Ooperations on the critical path to transactioncompletion. This includes all necessary readoperations and writes to rollback and log areas.In contrast, checkpoint writes are not latencysensitive per se, but merely need to keep pacewith the aggregate rate of Oracle demand.

5This is due to Oracle’s use of the O_DSYNCflag in its open(2) operations, but it is inherent withRAW disk access.

3

Decision Support Systems (DSS) and DataWarehouse (DW) loads typically depend a greatdeal on pure bandwidth maximization. Also,operations like large sorts can depend heavilyon effective bandwidth exploitation. Incontrast, Online Transaction Processing (OLTP)workload performance frequently hinges moreon random IOPS.

− Aggregate Demand

Different programming and tuning factors cansignificantly impact the total amount of I/Odemanded to execute a given set of businesstasks. The sum total of I/O required cansignificantly impact run times and equipmentutilization.

− Temporal Dependency

Some things have to happen before other thingscan happen. This simple notion of time ordereddependencies is all that temporal dependencymeans. For example, log file writes mustcomplete before an INSERT or UPDATEtransaction can complete, and before thecorresponding checkpoint writes can be issued.

− Temporal Distribution

In this context, the term temporal distributionmeans "distribution of demand over time". Theterms "bursty" and "sustained" are commoncharacterizations of temporal distribution. Insome cases, tuning to distribute "bursty"demand over time will result in improvedservice times and throughput.

As an analogy, consider the arrival of a bus at afast food store. The average service time foreach customer will be seriously impacted by thelength of the line. If the same customersstrolled in individually over a period of time,they would likely enjoy a better average servicetime. With demand evenly distributed overtime, servers might keep more consistently busydelivering good average service times, ratherthan idly waiting for the next bus.

− Physical Distribution

The mapping of Oracle data files acrossavailable channels and storage can be extremelyimportant to database performance. Thediagnosis and correction of "hot spots" or"skew" is a common activity for DBA’s, systemadministrators, and storage administrators.

The art of database layout includes applicationof principles such as I/O segregation, I/Ointerleaving, and radial optimization. Layoutdecisions can be significantly constrained byavailable hardware and project budgets.

Disoptimal layout can decrease I/O supply bycausing disk accesses to compete excessivelyamong themselves; by disoptimal use ofavailable cache; or by placing bottlenecks in thecritical path of temporally dependentoperations.

− Localities of Reference

The concept of localities of reference is that certainareas of a larger set of data may be the focus offrequent accesses. For example, certain indexesmight be traversed quite frequently, yetrepresent a small subset of the total data.

Access within good localities of reference standsto gain from optimal cache retention strategies.Cache efficiency will vary depending on howdata retention strategies align with actuallocalities of reference. Opportunities forcaching occur in the Oracle SGA, the OS pagecache, and external storage caches. Each offersdifferent controls over retention strategies, andthese may be configurable to achievecompounded beneficial effects.

Controlling Oracle I/O Demand

We’ll skip the obvious topics of proper databasedesign and index utilization, and focus on theareas typically controlled by an Oracle DBA.Topics discussed here represent several of theI/O issues customers commonly wrestle withOracle, and a few of the "big knobs" used tocontrol these factors. Some of these topicspertain to the "Top Ten Mistakes Made byOracle Users" presented in [ORA_9i_TUN].

− The Critical Path

As mentioned earlier, some I/O operations,such as log writes and rollback I/O, are on thecritical path to transaction completion. Others,such as log archive writes and checkpointwrites, are not, but must keep pace withsustained demands. This difference should bekept in mind during physical database design.

For example, the most demandingINSERT/UPDATE workloads may requirededicated disks or low latency cached

4

subsystems to allow the online logs to keep pacewith transactional demand.

− Reduce, Reuse, Recycle

This modern ecological mantra succinctlyencapsulates a great deal of the philosophy ofI/O tuning and cache management.

� Reduce − the best I/O is one that neveroccurs. Many programming and tuningtechniques have this as the underlyingstrategy.

� Reuse − this includes a broad variety oftopics including pinning objects withDBMS_SHARED_POOL.KEEP, using aseparate KEEP pool, and exploitingfilesystem buffering options.

� Recycle − this is the common conceptunderlying priority_paging (prior to OE8) and use of a separate RECYCLE pool inOracle.

− System Global Area (SGA) Size

Historically, 32−bit Oracle has allowed SGAsizes up to about 3.5 GB in the Solaris OE. 64−bit Oracle brings the possibility of extremelylarge SGA’s. Indeed, expanded SGA capabilityis the single distinguishing feature of 64−bitOracle.

Enlarging db_block_buffers is the principlemeans of exploiting large system memory.Proper sizing and use of the second largestcomponent of the SGA, the shared pool, is keyto performance for many workloads.

At any rate, SGA size must not exceed areasonable proportion of system memory inrelation to other uses. There are tradeoffs to bemade regarding the comparative advantages ofusing memory for SGA versus OS page cache,especially in consolidation scenarios.Regardless of how employed, system memoryprovides the potential for reducing physical I/Odemand by servicing logical reads frommemory.

Incidentally, Oracle 9i now allows dynamiconline resizing of the SGA in conjunction withthe new Dynamic Intimate Shared Memory(DISM) feature of Solaris OE 8.

− Concurrency

I/O concurrency with Oracle can arise fromseveral sources, including:

� The client population.� Use of PARALLEL features.� Use of log writer or DB writer I/O slaves .� Use of AIO.� Multiple DB writer processes.

Of these, the most common factors impactingdemand for write concurrency are AIO and theuse of multiple DB writers. Simply put, AIOallows greater I/O demand creation by asmaller number of Oracle processes with theleast overhead. Alternate schemes of achievingconcurrency, such as using Oracle I/O slaves,are in a different league from the I/O demandfacilitated by AIO.

Prior to Oracle 8, AIO and multiple DB writerswere mutually exclusive. With earlier versionsof Oracle 8, AIO was not used. There have beenvarious issues with AIO over the years whichhave led many users to disable AIO, oftenwithout firm reasoning. Over time, many usershave increased DB writers to its maximumsetting of 10 to achieve write concurrencywithout AIO. Some problems have arisen fromusers re−enabling AIO with maximal DB writersettings already in place.

In theory, each DB writer can issue up to 4096I/O requests. In practice, we often see DBwriter processes queuing about 200 requestseach. In either case, these are large numbersrelative to the ability of most systems toconcurrently service requests. Pushing an I/Obacklog from Oracle to the OS increasesqueuing and correlates with increased rates ofORA−27062 incidents.

Therefore, in short, when using AIO,db_writer_processes should not beincreased beyond one (1) without demonstrablebusiness benefit, and tuning upwards from oneshould be incremental. If AIO is switched offfor any reason, increasing DB writers is usuallyrequired to attain adequate write concurrency.

− Checkpoints & Log File Sizing

In the event of an abnormal shutdown, recoverytime will vary in proportion to the number ofdirty database blocks in the SGA. The longer adata block is retained in the SGA prior to

5

checkpointing, the more chance it has ofabsorbing INSERT or UPDATE data, and thegreater the efficiency of the checkpoint writewhen it finally occurs. Deferral of checkpointwrites using large logs leads to burstycheckpoint activity, which in turn makes logarchiving a relatively more bursty process, andfurther implies that log shipping strategies willbe very bursty.

Thus, both the temporal distribution ofcheckpoint writes and aggregate writes overtime depend on the size of the logs and thesetting of various parameters which controlcheckpointing. Tuning these factors representstradeoff between performance and recoverytime.

This has been an area of significantdevelopment in Oracle. Historically, Oracleparameters log_checkpoint_interval andlog_checkpoint_timeout have impactedthis balance. Oracle 8i variously offereddb_block_max_dirty_target andfast_start_io_target in this arena.Oracle 9i goes a step further in offeringfast_start_mttr_target and relatedparameters to explicitly control the timeexpected for recovery. Regardless of themethod of control, the tradeoffs remainessentially the same.

One can observe checkpoint activity in theOracle alert file by settinglog_checkpoints_to_alert=TRUE , butinterpretation of the resulting messages variesbetween Oracle releases.

− Sorting and Hash Joins

Disk sorts are best avoided or optimized. Hashjoins have a great deal in common with sorts,but here we will comment only on sorting.

The obvious and traditional means of reducingdisk sorts is to set large values forsort_area_size . Many users are reticent todo this, however, fearing that aggregatememory demand from clients will be excessive.However, only clients that do large sorts willdemand a maximal sort_area_size from theOS, so this prospect should not be feared asmuch as it seems to be.

It is worth noting here that since client shadowprocesses can write directly into TEMP areas,there is some potential for client−side ORA−

27062 incidents arising from intense writecontention for these areas. Also,recommendations regarding use of theTEMPORARY tablespace attribute for TEMPareas may vary between Oracle releases.

There are many other parameters pertaining tosort and hash join optimization, and [Alomari,2001] gives a good discussion of these.

− Large I/O

The Solaris OE default maximum low−level I/Osize is 128 KB6, but this can easily be tunedupwards to 1 MB in OE 2.6, or as high as 8 MBin OE 8. Regardless of kernel settings, anyprocess can issue large I/O requests, but largerequests may be broken up into smallerrequests depending upon kernel settings andvolume management parameters. Regardless ofwhether I/O requests are broken up at the OSlevel, using fewer larger operations forsustained sequential I/O is typicallyadvantageous due solely to decreased overheadfrom process context switches.

Factors effecting the supply aspects of LargeI/O do not fit neatly in the subsequentdiscussion of I/O supply topics, so we willmention them briefly here. In short, supply−side optimization of I/O requests involvesmany factors, including disk layout, volumemanagement parameters, and the kernel’smaxphys setting. Filesystem factors such asclustering, the UFS maxcontig setting, or dataprefetching logic can also have significantimpact. The topic of bandwidth optimization isthe focus of many I/O whitepapers.

In the case of sustained reading, large readsmight not always represent an ideal solution.Factors such as pre−fetching at the storagesubsystem and filesystem levels, competitionfor resources, and ’think time’ per unit datamay result in optimal throughput withintermediate read sizes. In other words,effective pipelining can conceivably reduce’dead time’ that can occur while waiting forlarge read requests to complete. In general,however, use of large I/O is often key toattaining maximum sequential throughput.

Most I/O demand from Oracle occurs indb_block_size units, and large I/O is used

6Controlled by maxphys in /etc/system ,which defaults to 131072.

6

only in limited circumstances. Parameterseffecting large I/O have been the subject ofmuch revision between Oracle releases, butsome examples include:

� Full table or index scan operations − readsare (db_file_multiblock_read_count *db_block_size ) bytes.

� Disk sorts − sort_write_buffer_sizebytes in Oracle 7, dynamically sized inOracle 8i.

� Datafile creation − writes are issued inccf_io_size bytes in Oracle 7, ordb_file_direct_io_count (DB blocks)in Oracle 8.

Some Oracle versions will revert the setting oflarge I/O parameters to 128 KB defaultswithout warning. The best way to confirm theactual I/O size being used by an Oracle processis to use the Solaris truss(1) command.

Setting large values of Oracle’sdb_file_multiblock_read_count has theside effect of biasing the optimizer’s preferencetowards using full scans. Because of this, theremay be tradeoffs between different queries.One workaround to this is to implement largereads at the session level (ie: ALTER SESSION)so that the impact is not global.

Memory constrained systems may suffer fromhaving excessive amounts of memory lockeddown due to active large I/O operations. Forthese reasons, the desirability of exploiting largeI/O will vary.

As a guideline, DSS and DW systems will tendto enjoy large reads, while OLTP systems maynot. Using large reads in conjunction with widedisk striping is a common formula foroptimizing sequential performance. Tuningread size to exploit filesystem and hardwarepre−fetch characteristics is another means ofimproving realized bandwidth.

− Data Checksumming

Oracle allows round−trip data integrityvalidation by generating checksum data foreach data block written, and validating thechecksum each time data is read. While thealgorithm used is extremely efficient, it doesimpose some additional latency on every readand write, and therefore has a slight throttlingeffect on demand. Also, for each read satisfied

from the OS page cache, the data must be re−validated, thus creating some feature bias infavor of a larger SGA and unbuffered filesystemI/O.

While this feature was default OFF in earlierOracle releases, db_block_checksum is TRUEby default in Oracle 9i.

− Backup Schemes

Time spent in HOT BACKUP impacts aggregatedemand from log and log archive writes,because entire DB blocks must be logged fortablespaces that are in BACKUP mode.

Data snapshot technologies, such as SunStorEdge Instant Image, Veritas VolumeReplicator (VVR), or Sun StorEdge 9900ShadowImage may be directed at reducing thisimpact.

3. I/O Supply Topics

This section is an annotated survey of variousI/O supply factors in the I/O stack from thebottom−up. High impact topics spring up at alllayers of the stack. Much as a chain is only asstrong as its weakest link, I/O supply can onlybe as good as the worst bottleneck will allow.

Supply Characteristics

Some additional terminology is useful fordiscussing I/O supply factors.

− Latency vs. Bandwidth

Defined simply, latency is the time required fora single operation to occur, while bandwidthcharacterizes the realized rate of data transfer.Both latency and bandwidth will vary in partwith the size of the requested operations.

In latency terms, there are three mainapproaches to I/O performance issues:

1. Reduce baseline latency, such as by addingcache, improving cache retention strategies,or by a baseline technology upgrade.

2. Improve work per operation, such as byusing a larger I/O size.

3. Increase concurrency, such as by addingchannels or disks, spreading demand acrossavailable resources, or placing moredemands upon existing resources.

7

Opportunities for these manipulations may bepresented at various layers of the I/O stack.

− Variance

In the case of a single simple disk, variance inservice times is explained primarily bymechanical factors, such as rotational speed andseek time. In the absence of hardware failure,one could say that with such a simple system,the variance is actually bounded by these well−known mechanical factors. In real systems,sources of variance are diverse, and include:

� Process prioritization (both for LWP threadsand simple UNIX processes).

� Geometry−based wait queue sorting.� Cross−platform competition in SANs.� Cache utilization.� Bugs in the I/O stack.

Variance is infrequently measured andcharacterized, but no advanced math orstatistics is needed to characterize the mostbasic measure of variance − range. It is onething to observe a range of I/O service times inthe range of, say, 5 to 500 milliseconds, butquite another to observe variance on the rangeof 5 milliseconds to 5 minutes!

Concern over variance might be summarizedlike this:

1. Sources of variance should be identifiedand scrutinized.

2. Sources of unbounded variance are badand beg to be eliminated.

3. Sources of bounded variance beg to bepurposefully controlled.

I/O statistics are most often cited as averagesand sums, and most folks expect variance to be’reasonable’ about the mean. Certainly, ORA−27062 incidents are categorically examples ofwhen this fails to occur.

− Competition

I/O supply to Oracle is net of any competition.In other words, one can say that the supply ofI/O to Oracle is diminished by competing I/Odemands involving the same resources.Possible sources of competition are numerous,and some are easily overlooked. Considerthese:

� Backup and recovery operations

� Volume reconstruction, mirror re−silvering� Physical data reorganization� Transient utility operations� Other hosts in a SAN� Data services (eg: volume copies)� Committal of previously cached write data

Even when these activities occur ’off host’, theycan compete for head movements and decreasethe net I/O supply to Oracle, both in terms ofIOPS and latencies.

− Throttles

Anything that extends the code path for an I/Oimplicitly adds to I/O latency. Compared toother factors, marginal code path costs arerelatively minor in practice. Some componentsin the I/O stack explicitly enforce certainlimitations.

Supply Factors − Bottom Up

Each layer in the I/O supply chain cancontribute significant complexity. This surveyof the I/O stack discusses a limited set of issuesat each layer.

− The Disk Spindle

The ultimate limiting factors to sustainableIOPS in a complex system are the number ofdisks and their operational characteristics.Effective cache utilization at the system andstorage subsystem layers may increase thesupply of logical IOPS, but when caches andbuses are saturated, it is the humble diskspindle that sets the speed limit.

The primary disk characteristics are rotationalspeed and seek time. Disks vary in many otherways though, including number and type ofaccess paths; quantity of onboard cache andintelligence of onboard cache management; useof tagged commands; and sophistication oferror recovery logic. From time to time,significant disk performance issues may bediscovered in disk firmware.

− Storage Subsystems

The range of features available from storagesubsystems has mushroomed in recent years.The spectrum of subsystems ranges from the

8

basic JBOD7 array to sophisticated RAIDsubsystems and Storage Area Networks(SANs). Even humble JBOD arrays havedeveloped sophisticated features such asenvironmental monitoring and fault isolationcapabilities.

Intelligent arrays can take complexity off thehost and facilitate new data managementparadigms, such as off−host backups,contingency copies, and sophisticated disasterrecovery strategies, in addition to providingbasic RAID features. For this discussion of I/Osupply, only a small subset of these features willbe covered, including, RAID, caching, and dataservices.

− Storage Subsystems: RAID

When implemented in an integrated subsystem,RAID features are variously called ’hardwareRAID’, ’box−based RAID’, or ’external RAID’.

In reviewing various RAID levels, an essentialpoint to note is that ’striping’ across disks isessential to achieving sustainable bandwidthgreater than that of a single spindle. RAID 0 ismerely striping across disks with no dataprotection features. RAID 5 and similarschemes also involve striping across disks.

In the case of RAID 5, since parity must begenerated and saved to disk, write activity willhave the effect of causing additional mechanicalactivity that may compete with read requests.Sophisticated integrated storage subsystemslike the Sun StorEdge T3 or the SunStorEdge 9900 series arrays, are designed tominimize this impact by intelligent I/Oscheduling.

In the event of a complete disk failure in aRAID 5 set, all disks in the set must be read toreconstruct lost data, and that process candecrease the net supply of IOPS quite seriously.Once a replacement disk is installed, the processof rebuilding a failed spindle involves readingall of the surviving disks completely, and thatcan produce significant competition with userdemand. RAID 5 implementations typicallyfeature some means of tuning the rebuild rate toavoid unacceptable competition with liveworkloads. Some more sophisticated RAID 5implementations, such as in the Sun StorEdge

7JBOD means ’Just a Bunch Of Disks’, and hascome to be common terminology in the industry.

9900 series arrays, may predict the likely failureof a particular disk, and evacuate its data farmore efficiently than by RAID 5 reconstruction.

It is sometimes overlooked that in the absenceof writes, RAID 5 read performance should beexpected to be comparable with RAID 0performance, since RAID 5 parity data isusually never referenced unless a disk fails8.

− Storage Subsystems: Caching

Common to hardware RAID implementations isthe use of persistent cache memory. Cachemanagement in external RAID offers thepossibility of enhancing I/O supply to attachedsystems variously by caching writes or byfacilitating pre−fetching or other intelligent dataretention schemes.

When external cache is disabled or saturated,subsystem performance as seen by attachedhosts will degrade. External cache will typicallycease to be used for caching writes if its batterybackup becomes dubious, but read caching istypically not impacted when cache persistencebecomes questionable. Control over cachingpolicies, such as pre−fetching or random readdata retention, varies tremendously betweendifferent subsystems, and are often controllableon a per−LUN9 basis.

One simple fact that is often overlooked inevaluating empirical I/O statistics is that writeswhich appear very low−latency to a host are nottruly complete until they are flushed from theexternal cache. This flushing activity cancompete with new I/O requests for headmovement and therefore increase latency,especially for new reads. Correlation of currentI/O activity with recent writes may be difficultwith host−based tools, but one should keep inmind that "what goes up, must come down".

It is widely considered that RAID 5 performsless well on writes than alternatives such asRAID 1. While this may be true for host−basedRAID 5, it is not categorically true forhardware−based RAID solutions. So long aswrite demand remains at levels that do notsaturate the cache, all RAID levels look

8An exception would be ’parity checking’operations implemented in some subsystems.

9A LUN, or Logical Unit Number, is the termmost often used to designate a virtual disk managedby an intelligent storage subsystem.

9

essentially the same to attached hosts so far aswrites are concerned. That is, writes to cachewill be very low latency until and unless thecache becomes saturated.

− Storage Subsystems: Data Services

Persistent cache in external arrays can also beleveraged in implementing efficient box−baseddata services, such as Sun StorEdge 9900ShadowImage and Sun StorEdge 9900TrueCopy. As with host−based data services,these can contribute some competition andvariance to the I/O stack. Any synchronousdata forwarding product can inherittremendous variance from the Quality ofService (QOS) of the underlying data link.

− Layout Strategies

Whether or not an intelligent storage subsystemis involved, placement of data across availabledisks can have significant impact on the actualsupply of I/O. Regardless of how it comes tobe, if two ’hot spots’ come to reside at oppositeends of the same disk spindle, performance willsuffer. With modern SANs, competition forhead movement may even originate fromdifferent hosts.

The trend toward higher disk densities andlower IOPS per disk gigabyte has led tosubstantially different layout strategies fromjust a few short years ago when use of manysmaller spindles was the norm. Right or wrong,the most popular strategy a few years ago wasto segregate I/O by class, whereas now themore popular strategy is far more bias towardsinterleaving I/O classes.

Oracle has promoted a "Stripe and MirrorEverything" or "SAME" approach, while otherwriters have presented similar conceptsvariously as "Wide−Thin Striping" or "Plaid"layout strategies. The common underlyingnotion is to distribute all (or most) Oracle I/Ouniformly across a large spindle population,thereby distributing demand across availablespindles and minimizing the likelihood ofexcessive queuing for any single disk. Thesestrategies are often implemented using acombination of hardware−based RAID andhost−based RAID.

Another consideration commonly made inlayout strategies is the fact that the outer

cylinders of disks contain more data10, andtherefore provide slightly better performancethan the innermost cylinders. Exploitation ofthis characteristic is sometimes called ’radialoptimization’.

In practice, disk layout is something of an art,though its underpinnings are all scientific. Itcan be helpful, if not essential, to constructdrawings of available disks, channels, and theirusage to be able to comprehend any given disklayout. Especially when a great deal of thecomplexity may be housed in an intelligentstorage subsystem, effective visualization of theissues may otherwise be impossible.

− Disk Interconnect

The most conspicuous characteristics of a diskinterconnect are its signaling rate and itsprotocol overhead. 100 MB/sec (1 gigabit persecond) Fiber Channel Arbitrated Loop(FC/AL) interconnects are now thecommonplace building blocks for modern SANarchitectures. Speeds of 2 Gbit/sec and beyondare on the horizon, as are alternate protocolssuch as iSCSI and Infiniband.

Besides the obvious bandwidth limitationsimposed by the native signaling rate, there aretwo aspects of FC/AL that deserve specialmention. First, as a loop−based protocol, somescaling issues can arise if too many targets sharethe same loop. Second, due to the ’arbitratedloop’ aspect of FC/AL, its performance candegrade rapidly with distance. For this reason,SAN switches employing more efficient Inter−Switch Link (ISL) protocols may be required toeffectively extend FC/AL over distances.

Alternate interconnect paths can variously serveto add resilience to a system or be exploited forperformance. Management of alternate paths isimplemented in several software productsabove the interconnect layer. Multiplexed I/O(MPXIO) provides OS level awareness ofmultiple paths as an option for Solaris OE 8.Veritas Volume Manager provides a DynamicMultipathing (DMP) capability. Alternate pathsto a Sun StorEdge 9900 series array can bemanaged with Sun StorEdge 9900 DynamicLink Manager. Each of these products has its

10This is due to variable formatting known asZone Bit Recording, or ’ZBR’.

10

own feature set and capabilities with respect toI/O supply and error handling.

− Interface Drivers

A step above the host bus adaptor (HBA)hardware is a device driver software stack.This stack begins with HBA−specific drivers(eg: esp, isp, fas, glm, pln, soc,socal ), which are integrated using the DeviceDriver Interface (DDI) framework. Higher levelfunctions are provided by target drivers like sd(for SCSI devices) and ssd (for FC/AL devices).

Some key quotas and controls for disk I/O areimplemented in the sd and ssd drivers.Despite the close relationship between the sdand ssd drivers, their controls are distinct.Generally speaking, variables discussed here forthe sd driver are the same in the ssd driver,but are named with ssd as the initial letters.

Error recovery parameters, such assd_retry_count and sd_io_time , controlthe time required for the driver stack to returncontrol to higher layers of the I/O stack in theevent of hardware failures. By default, theseallow some failures to take up to five minutes tobe reported to upper layers. Varying thesecontrols can be tricky business, as they alsopertain to the resilience of systems duringpower up operations. This is a topic that begsfor further development.

The maximum number of concurrent requeststhat can be passed to the HBA driver iscontrolled by sd_max_throttle . Whilecertain HBA hardware or attached storage maybe limited with respect to its capacity forconcurrency, this setting has systemwideimpact which can have adverse performanceimplications. By default, this value is 256,which maximally exploits SCSI tagged commandqueuing capabilities.

Several systemwide SCSI features can becontrolled by the bitmask variablescsi_options , including tagged commandqueuing and control over the maximum transferrate which can be negotiated for SCSI transfers.Control over specific interfaces and targets maybe available at the HBA driver level, such asdescribed in glm(7D) . It would appear that thescsi_options parameter is frequentlymanipulated by third party storage vendors,perhaps often when it should not be. These

controls should be manipulated only with a fullunderstanding of the desired objective and thepossibility of unintended consequences.

One aspect of the sd layer that is presently notadjustable is logic that continuously sorts thedisk wait queue according to the geometry ofthe underlying disk. While this is quite usefulwith direct attach JBOD disks, the geometry ofexternal disk subsystems presented to the OS isusually purely fictional. Write caching and I/Oscheduling in external RAID subsystems makethis sd feature of dubious value, and it couldconceivably introduce undesirable variance toresponse times. The ability to control this mayappear in a future release of the Solaris OE.

− In−host Write Caching

The Sun StorEdge Fast Write Cache (FWC)(see [Sneed, 2001]) provides an in−hostredundant and persistent write cache which cangreatly reduce the write latency seen by Oracle.While FWC imposes a throttle on bandwidthand is incompatible with cluster environments,its reduction in latency can be quite beneficial toprocesses which depend on serially dependentwrites on the critical path to transactioncompletion, such as Oracle’s log writer.

Another means of in−host write caching is useof cached RAID interface cards, such as the SunStorEdge SRC/P Intelligent SCSI RAIDController System.

− Host−based Data Services

A wide variety of host−based data services areavailable, including Sun StorEdge InstantImage (II), Sun StorEdge Network DataReplicator (SNDR), Veritas Volume Replicator(VVR), and an assortment of filesystemsnapshot facilities. These categorically impactaggregate physical I/O demand, and generallyintroduce some variance, throttling, andtemporal dependencies in maintaining datastructures such as bitmaps.

As with their externally−based counterparts,synchronous remote replication products canintroduce significant or pathological throttlingand variance depending on the Quality ofService (QOS) of the network interconnects.

Nevertheless, as a class, these productseffectively facilitate a wide range of costeffective business solutions. System designs

11

should incorporate appropriate considerationsto assure success with these products. Forexample, cached and/or dedicated storagemight be allocated for bitmap data structures.

− Volume Management

Volume managers, such as the Veritas VolumeManager (VxVM) or Solstice DiskSuite (SDS)provide myriad techniques for manipulatingI/O supply. The code path overhead of VolumeManagement is usually considered to be soslight as to be insignificant, but several topics involume management can have extremesignificance. Volume Managers provide host−based RAID capabilities, which is where manylayout decisions are typically implemented.

Host−based RAID 0, RAID 1, and hybrids of thetwo actually impose very little overheadrelative to the native capabilities of the attachedstorage. Host−based RAID 5, which isimplemented at the Volume Manager layer, isentirely another matter. Since host memory isnot persistent like the caches of hardware RAIDsubsystems, RAID 5 parity data must be flushedalong with each synchronous write in realtime,and this is typically disastrous relative todatabase write requirements. On the otherhand, some very large DSS databases mightrationally employ economical host−basedRAID 5 for predominantly read−oriented taskswhere the write penalty is not too severe withrespect to the rate of data updating. Host−based RAID 5 exhibits very good readbandwidth.

Due to the popularity of VxVM on Sunsystems, VxVM issues tend to have a high rateof impact when they occur. Therefore, a fewdetails specific to VxVM are offered here.

With mirrored volumes implemented in VxVM,recovery time after an abnormal systemshutdown can be greatly reduced byimplementing Dirty Region Logging (DRL’s). Itis generally recommended to group all DRLareas for a system on a few disks dedicated tothat purpose. When properly implemented,DRL maintenance overhead is oftencharacterized as reducing write supply by about10%. Without DRL’s, recovery after a crash canrequire complete resilvering of disruptedmirrors.

One VxVM issue to which ORA−27062incidents were explicitly tied pertained tosystems using Dirty Region Logging (DRL’s)with mirrored volumes11 under heavy load. AnI/O could get ’hung’ awaiting a DRL update,thus leading to extreme variance.

Another recent issue was noted in VxVM 3.0.1where the default read policy for mirrored datawas changed from a prior default that tended tofavor one side of the mirror for a series ofsustained reads12. Under the prior default withJBOD arrays, sustained reading would benefitfrom revisiting the spindle cache of the chosendisk. This issue surfaced in terms of reducedsupply of sequential performance.

− Filesystem Cache

Simply put, filesystem caching is bad forsynchronous writes, but good for reads that aresatisfied from it. That being said, it may not besimple to determine the optimal filesystemcaching scheme for a given Oracle workload.Filesystem cache performance and itscontribution to Oracle performance can varygreatly depending upon caching options usedand the amount of memory available. With 32−bit Oracle, use of the filesystem cache is aprincipal means of exploiting large servermemories.

In the Solaris OE, the filesystem cache is not afixed size area, but rather lives in the OS pagecache, so the terms are used interchangeably inthis discussion. See [Mauro & McDougall,2001] for more information on Solaris memorymanagement.

The ’extra copy penalty’ often cited againstusing filesystem cache is not highly significantin the overall scheme of things. Copying data inmemory is an operation at which modern Sunsystems are particularly adept.

− Filesystem Block Size

Since Oracle I/O is typically indb_block_size increments and 8 KB blocksor larger are becoming the norm, it is usuallyadvised to use a VxFS filesystem block size of8 KB. Smaller block sizes will decrease supplyand cost in terms of CPU usage when used on

11See Bug Report #4333530, fixed at VxVM 3.1and by current patch to version 3.0.4.

12See Bug Report #4255085, fixed at VxVM 3.1.

12

Oracle data filesystems that never see I/Orequests smaller than 8 KB.

By default, VxFS selects the filesystem blocksize at the time of filesystem creation based onfilesystem size, without regard for its intendeduse. Therefore, one must explicitly specify ablock size option to mkfs 13 to assure consistentresults.

On UFS, 8 KB is the only blocksize permitted onmodern Sun systems.

− Filesystem Throttles

UFS has tunable thresholds at which it willsuspend and resume processes withuncommitted deferred write data, and thusprevent a single process from overrunning thepage cache or using an inequitable share. A’high−water mark’ (UFS:UFS_HW) and ’low−water mark’ (UFS:UFS_LW) implement thisthrottle.

A similar throttle is introduced with VxFSVersion 3.4. Absent such a throttle, theimportance of aggressive Virtual MemoryManagement with deferred buffered I/O wasmagnified (see [Sneed, 2000]).

The mere presence of a filesystem code path hassome I/O throttling effect relative to RAW disk,but this turns out to be a very minor factor inmost cases.

− Filesystem Single−Writer Lock

Perhaps the single most significant factor thatlimits Oracle write throughput with filesystemsis the "single writer lock" constraint required byPOSIX standards to assure proper ordering ofsynchronous writes. Increased writeconcurrency at the Oracle level is for naughtwhen the filesystem serializes the writes. Theimpact of the single writer lock is reduced withexternally cached storage, which provides lowwrite latency and correspondingly lowered lockdwell times.

Since Oracle handles its own write orderingconsiderations, the single writer lock commonto standards−conforming filesystems is bothunnecessary and injurious to Oracle I/Operformance. There are several ways to avoidthe single writer lock altogether, including:

� Using RAW disk volumes, LUNs, or slices.

13See mkfs_vxfs(1M) .

� Using VxFS Quick I/O (QIO)� Using the UFS concurrent forced direct I/O

feature (forcedirectio ) available inSolaris OE 8, Update 3 or later.

� Using the Sun QFS filesystem Q−Writefeature.

For users on filesystems, implementing VxFSQIO, UFS forcedirectio, or Sun QFS Q−Writerequires no movement of data. In contrast,conversion from filesystem to RAW wouldrequire completely offloading and reloading thedata, as would conversion betweenfilesystems14.

Other points to ponder about these optionsinclude:

� QIO and RAW share the advantage ofexploiting the Kernel Asynchronous I/O(KAIO) code path, which is capable ofsupplying more IOPS than the LWP AIOcode path.

� QIO offers the ability of per−file control offilesystem caching (’Cached QIO’) and theability to observe read hits rates in the OSpage cache, while UFS forcedirectiocategorically disables OS buffering.

� QIO introduces some operational complexityrelative to UFS forcedirectio.

� QIO is a separately licensed feature, whileUFS forcedirectio comes with the Solaris OE.

Avoidance of the single writer lock and use ofthe KAIO code path are known techniques forreducing the probability of ORA−27062incidents. Each option for doing this has itsown pros and cons.

− Asynchronous I/O (AIO)

The Solaris OE offers two alternate AIOApplication Programming Interfaces (API’s).One conforms to the POSIX Real Time (RT)specifications (eg: aio_read(3RT) ,aio_suspend(3RT) , et al) implemented inlibrt(3LIB) , and the other conforms toSUNW private specifications (eg:aioread(3aio) , aiowait(3aio) , et al)implemented in libaio(3LIB) .

Oracle uses the SUNW API.

14In some cases, tools may exist to allow in−situconversion of filesystems.

13

The SUNW implementation in libaio includesa hard−coded throttle called _max_workerswhich limits the number of active requests perprocess. This value is set to 25615. A processcan queue requests past this limit subject toavailable memory, but queuing far past thislimit offers no tangible benefit.

For more information on SUNW AIOimplementation and usage, see [Mauro &McDougall, 2001] and [McDougall &Gummuluru, 2001].

Kernel Asynchronous I/O (KAIO), which isused on RAW and VxFS QIO files, is often citedas yielding 10−12% gains on extremely write−intensive benchmark workloads. However, formany workloads its impact may be negligible,especially considering that many databaseworkloads are either extremely read biased ormore constrained by bandwidth than writeconcurrency.

As mentioned earlier, it is not advisable to usetoo many DB writer processes when AIO isused.

− Process Scheduling

Last but not least, Oracle processes cannotgenerate I/O demand unless they are inmemory, ready to run, and not deprived ofneeded CPU and memory resources. Asidefrom the topics of having the CPU and memoryappropriately sized for the workload, thisrelates to several system management topicsincluding Virtual Memory Management (VMM)configuration16 and manipulation of processpriorities.

Process priority manipulation is a rathercomplex topic in the Solaris OE, and a wideassortment of tools and techniques areaddressed at this topic. dispadmin(1M) andrelated topics from the ’SEE ALSO’ section ofthe dispadmin man page are the main tools formanipulating priority schemes. Other facilitiesinclude the venerable nice(1) command;processors sets (psrset(1M) ); and theseparately licensed Solaris Resource Manager.

It is noteworthy that each AIO request tofilesystem files involves a lightweight process

15This was 50 prior to Solaris OE 2.6.16See [Sneed, 2000] for more information on

VMM tuning and Intimate Shared Memory (ISM)usage with Oracle.

(also called a ’LWP’ or ’thread’), and that thesethreads have their priorities shuffled in thesame manner as other threads in the TSscheduler class. This can give rise to variance inresponse times seen by Oracle, and withextreme loads, this variance can be pathological.

Applied incorrectly, priority manipulations canlead to seriously adverse outcomes.

Conclusion

Hopefully, some of the perspective and detailsoffered here will be found to be useful inapplied terms.

Each layer of the I/O stack can be meaningfullydiscussed in terms of its impact on the supplyof I/O to the layers above it. I/O demandresults in an equilibrium that depends onnumerous supply factors. This perspective canbe useful in comprehending the complexity ofthe I/O stack, and in understanding I/Ooptions and tradeoffs.

References

Oracle publications are generally availableonline via the Oracle Technology Network(OTN) Web site at http://otn.oracle.com.Membership is free.

[ORA_9i_PER] − "Oracle 9i DatabasePerformance Guide and Reference", OracleCorporation Part No. A87503−02, 2001.

[ORA_9i_TUN] − "Oracle 9i DatabasePerformance Methods", Oracle CorporationPart No. A87504−02, 2001.

[Alomari, 2000] − "Oracle 8i and UNIXPerformance Tuning", Ahmed Alomari,Prentice Hall, September 2000, ISBN0130187062.

[Mauro & McDougall, 2001] − "SolarisInternals", Jim Mauro & Richard McDougall,Prentice Hall, 2001, ISBN 0−13−022496−0.

[YAPP Method, 1999] − "Yet AnotherPerformance Profiling Method (YAPPMethod)", a whitepaper by Anjo Kolk, ShariTamaguchi & Jim Viscusi of OracleCorporation, 1999.

[McDougall & Gummuluru, 2001] − "OracleFilesystem Integration and Performance" − a

14

whitepaper by Richard McDougall and SriramGummuluru or Sun Microsystems, January2001.

[Sneed, 2001] − "Sun StorEdge Fast WriteCache Application Notes" − a whitepaperpresented at the April, 2001 Sun Users andPerformance Group (SUPerG) conference.

[Sneed, 2000] − "Sun/Oracle Best Practices" − aSun Blueprints Online article available athttp://www.sun.com/blueprints/0101/SunOracle.pdf.

Acknowledgments

Special thanks to Geetha Rao (Veritas), VanceRay (Veritas), Yuriy Granat (Oracle), and JamieVuong (Oracle) from the Veritas/Oracle/SunJoint Escalation Center (VOS JEC) for all theircontributions to the VOS JEC. Thanks to JimViscusi (Oracle), Rey Perez (Sun), and Elizabeth

Purcell (Sun) for their review and feedback.Thanks also to Janet, my wife and volunteereditor. This document was created usingStarOffice software.

© 2001 Sun Microsystems, Inc. All rights reserved.Sun, Sun Microsystems, the Sun logo, Solaris, SolarisResource Manager, StarOffice, Sun Blueprints, SunQFS, Sun StorEdge, Sun StorEdge Instant Image, SunStorEdge Network Data Replicator, Sun StorEdge T3,Sun StorEdge 9900, Sun StorEdge 9900ShadowImage, Sun StorEdge 9900 TrueCopy, SunStorEdge Fast Write Cache, Sun StorEdge SRC/PIntelligent SCSI RAID Controller System, andSolstice DiskSuite are trademarks or registeredtrademarks of Sun Microsystems, Inc. in the UnitedStates of America. All SPARC trademarks are usedunder license and are trademarks of SPARCInternational, Inc. in the United States and othercountries. Products bearing SPARC trademarks arebased upon an architecture developed by SunMicrosystems, Inc.

15