Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. The DBA Disaster Diary Real-world Oracle failures: How

Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.

The DBA Disaster Diary

Real-world Oracle failures:

How they were resolved,

and how they could have been prevented


Jeremiah Wilton, OCMwww.speakeasy.net/[email protected]

Independent Oracle ProfessionalSeattle, Washington

http://www.speakeasy.net/~jwilton

mailto:[email protected]


• Many companies pursuing HA• Prolonged projects• Researching, designing and

implementing• RAC• Dataguard• Clusters• Remote DR sites

• Huge capital expenditures

High Availability


Benefits of HA Big-Iron4-week periods around rollout

0

2

4

6

8

10

12

Hoursdown

Before HA During After HA

Planned

Unplanned


For my average outage in the past 12 months, would RAC or Dataguard have

helped me?


Hypothetical #1

• 2-node RAC cluster• Array controller fails• Replacement parts 6

hours away• No value from RAC in

this outage• Dataguard might have

helped…


Hypothetical #2

• Same controller failure with 2-site Dataguard• Standby is missing last few transactions

unless using expensive zero data loss• Management: “Wait for the part, save the

data.”• Dataguard provides no value in this outage

DataGuard


Hypothetical #3

• Same 2-site Dataguard• Developer’s backfill is missing a where

clause• All customer names updated to “Fred

Flintstone”• Developer takes 30 minutes to realize the

mistake and alert the DBA• Too late; the standby has committed the

changes• Dataguard provides no value in this outage


Solutions for hypothetical #3

• Flashback Query– Extension of CR mechanism– SMU only– Retention period

• Logminer– Generate undo SQL– Apply one rowid at a time.


Hypothetical #4

• RAC• Zero data-loss dataguard w/3-day apply delay• SMU for flashback query• Logminer at the ready

Zero data loss DataGuard


Hypothetical #4

• Belt• Suspenders• Duct tape• Dry suit• Kevlar helmet

Welcome to the Disaster Diary…


Chapter 1: Double Failure

Setting: Internet StartupCirca: June 1997Staff: 1 novice DBA, 1 systems engineerHardware: DEC Alphaserver 4800 8 procsStorage: 2 DEC Storageworks HSZ50 arrays of

100Gb of RAID0+1Operating System: Digital Unix 4.0BOracle version: 7.3.2.3Oracle Support: Silver (US)Backups: Daily hot tablespace backup to DLT tape


Some company background

• Independent thinkers

• Intelligent• Specific

knowledge, certifications less important

• Critical thinking• Work ethic

• New DBA after two years of developer management

• Developers had managed well– Hot backups– v$sesstat tuning tools– v$session_event tools– Lock monitoring– Space management


Vigilant at investigating new features

• 7.3 Maxextents unlimited• Prevent senseless DML failures

select 'alter '||segment_type||' '||segment_name||' storage (maxextents unlimited);'from dba_segmentswhere segment_type in ('TABLE','INDEX');

• Seemed to work well…


Planned Outage

• During the DBA’s third week• Modify and add some tables (simple

DDL)• Planned for late Thurs. PM/Fri. AM


DLT problems

• Thursday AM before outage, backup failed with a DLT drive error:/opt/app/oracle/product/7.3.2/local/bin/gnutar: Cannot write to /dev/nrmt0h: I/O

error

BUS free ERROR - os_std, os_type = 11, std_type = 10 (from uerf)

• Outage on Fri. AM during backup window

• Last good backup before outage is Wed. AM.


15-Minute Outage

1. Put out the closed sign on the website2. Shut down the listeners3. Shut down the instance (immediate)4. Start up the instance in restrict mode5. Make schema changes6. Start up the listeners7. Restart the webserver software so it reconnects

to Oracle8. Make sure the site is working internally9. Take down the closed sign and let the public

back in


The Outage

• Friday 01:00SE begins outage steps; gets stuck reopening database with ORA-00600 [4000].

• Friday 01:15SE trying to figure out the problem. – Checks host logs – alert file– trace file from the ORA-00600– Searches Alta Vista and newsgroups for similar occurrences; no

positive results.

• Friday 01:45SE in Silver Support severity-1 call queue.

01:00 06:00 12:00Friday

18:00 00:00 06:00 12:00Saturday

18:00 00:00 06:00Sunday


The Outage

• Friday 02:00SE gets Oracle Support in the UK. Analyst opens a TAR– Requests alert and trace files via email.– Will call back when received

• Friday 02:45Oracle Support has not called back– SE calls back in on the TAR– Analyst says she did not receive any email! “Sometimes their email

systems can be slow.”– Around this time, SE receives his email bounced by Oracle’s mail

server for being too large.

Oracle and SE agree on FTP – SE uploads the files– Support analyst will examine and call back shortly

01:00 06:00 12:00Friday

18:00 00:00 06:00 12:00Saturday

18:00 00:00 06:00Sunday


The Outage

• Friday 04:30DBA, CTO, VP Tech. arrive. Site now down 3.5 hours.– VP starts calling people at Oracle – The DBA is horrified. – DBA & SE call Oracle Support again since no callback yet– Support says “corruption of some sort”; restore from backup and roll

forward

• Friday 05:00DBA & SE begin restore

• Friday 06:15Restore completes– DBA & SE begin roll forward through 48 hours of logs– Unaware of parallel recovery feature; recovering serially

• Friday 06:45– Recovery time projected to take nine hours

01:00 06:00 12:00Friday

18:00 00:00 06:00 12:00Saturday

18:00 00:00 06:00Sunday


The Outage

• Friday 12:00A “friend” of the company knows Ellison; calls him;– TAR transferred to bug diagnostics and escalation (BDE)– Local Oracle office offers to send personnel on site

• Friday 15:30Roll forward completes. A cold backup is taken before attempting to open the database.

• Friday 16:00• Recovered DB fails on open with ORA-00600 [4000] again

– Restore/recover had no effect

01:00 06:00 12:00Friday

18:00 00:00 06:00 12:00Saturday

18:00 00:00 06:00Sunday


The Outage

• Friday 16:4516 hours into the outage– BDE divulges a known bug:

[BUG:434596] prevents a database from being opened if BOOTSTRAP$ has its MAXEXTENTS value modified. Although it is unlikely anyone would intentionally change this, a number of users have run scripts to change the storage on ALL tables and in doing so have encountered this problem. Fixed in 7.3.4.

– Restore/Recover was a waste of time– Original Support recommendation lacked specific

understanding of problem

Roll forward did nothing to fix several-week-old BOOTSTRAP$ extent modification

01:00 06:00 12:00Friday

18:00 00:00 06:00 12:00Saturday

18:00 00:00 06:00Sunday

http://metalink.oracle.com/metalink/plsql/ml2_documents.showDocument?p_id=434596&p_database_id=BUG


Remain calm, think clearly


Macro timeline

• This is a “sleeper”– Data-related problem introduced many backup

cycles ago– Sits on disk until database restart– Too much transpired since last “good” backup– No real recovery strategy available to end user


The Outage

• Friday 17:30BDE requests dial-in access– “Work” copy of the DB placed new DEC Storageworks HSZ50

array for BDE– Original broken DB preserved in place

Local Oracle office says on-site support analyst on the way

• Friday 18:00Oracle Support BDE dialed in and working

• Friday 19:00On-site Oracle Support arrives

01:00 06:00 12:00Friday

18:00 00:00 06:00 12:00Saturday

18:00 00:00 06:00Sunday


The Outage

• Friday 19:53Media starts to take notice; Reuters report leads to articles in:– Wall Street Journal– TV networks– Online news sites

Web site suffers outageLOS ANGELES, June 6 (Reuter) - The World Wide Web site of [the company] went down Friday, preventing Internet customers from ordering … from the online retailer. According to a notice posted on the company's Web page, work was being done on the system and customers should try back later. A spokesman for [the company] was not immediately available to comment on the reason for the outage or how long it would last.

• Friday 22:00BDE uses BBED to patch blocks of BOOSTRAP$’s extent map in datafile 1 of SYSTEM tablespace

01:00 06:00 12:00Friday

18:00 00:00 06:00 12:00Saturday

18:00 00:00 06:00Sunday


The Outage

• Friday 22:15DB successfully opens!– BDE repaired DB on

new storage array– Now running on the

new array

• Friday 22:30Sanity testing begins

• Friday 23:00DB host crashes with Unix kernel fault!– 20 minutes to reboot

• Friday 23:30Site back up and available on the Internet

Total outage time: 22 hours and 30 minutesReduced YTD availability from 99.986% down to 99.729%

• Saturday 00:00Hot backup of repaired DB started

• Saturday 01:00Everyone goes home.

01:00 06:00 12:00Friday

18:00 00:00 06:00 12:00Saturday

18:00 00:00 06:00Sunday


The Outage

• Saturday 06:15DBA & SE paged with corruption errors in alert file:

Corrupt block dba: 0x4c0530a6 file=19. blocknum=340134. found during buffer read on disk type:6. ver:1. dba: 0x4c0530a6 inc:0x000004e0 seq:0x0000026c incseq:0x00110000

Reread of block=4c0530a6 file=19. blocknum=340134. found same corupted data

Errors coming fast, but website is still up– Small number of read failures

• Saturday 06:45SE back on phone with OracleDBA & CTO on their way back

• Saturday 08:00DBA arrives and joins call with Oracle

– BDE engaged– Everyone suspects Oracle problem related to block-munching by BDE– Non-website internal tools are disabled– BDE looks at block dumps; log in to examine the problem

01:00 06:00 12:00Friday

18:00 00:00 06:00 12:00Saturday

18:00 00:00 06:00Sunday


The Outage

• Saturday 12:00BDE says new RAID array randomly and steadily corrupting blocks

The DBA is horrified.

SE makes a point:– Although blocks on the disk getting corrupted, redologs are on good array– Redologs not getting corrupted– No corrupted data being read and rewritten

DBA & SE conclude that a good version of each successful transaction preserved in the redologs

Middle of day and peak website period– DBA & SE decide to keep the database up– Will run and in hobbled state until period of low activity– Then restore most recent backup to working array & roll forward

01:00 06:00 12:00Friday

18:00 00:00 06:00 12:00Saturday

18:00 00:00 06:00Sunday


The Outage

• Saturday 16:30DBA & SE copy cold backup from before first BDE patch to good array– Most recent hot backup taken from bad RAID array– Previous one taken after initial restore and recover but prior to

patching– BDE will have to log in and re-patch this copy

• Saturday 21:00Restore & recover in parallel with running DB– DBA & SE start applying Saturday’s logs to the restored database– Corrupting copy still running– Parallel recovery this time

• Sunday 03:00Recovery catches up to the corrupted open production DB– BDE logs in and reapplies block patches for BOOTSTRAP$ bug– Recovered copy still not open; Corrupted copy still running

01:00 06:00 12:00Friday

18:00 00:00 06:00 12:00Saturday

18:00 00:00 06:00Sunday


The Outage

• Sunday 04:00Site and running corrupted DB taken down– DBA & SE take cold backup of recovered and repaired database– Last few logs generated by the corrupted system applied

• Sunday 06:00Repaired DB opened; testing begins

• Sunday 08:50Testing complete; Site back up

• Sunday 09:00CEO arrives and takes pictures with disposable camera

• Sunday 10:00VP Tech. takes IT staff out for breakfast but nobody hungry

01:00 06:00 12:00Friday

18:00 00:00 06:00 12:00Saturday

18:00 00:00 06:00Sunday


Causes of outage,Avoidance strategies

• BOOTSTRAP$ bug “sleeper”– Watch for bugs (no good way)– Doing things to dictionary tables (not documented)– Reasonably, this was an unavoidable problem– Could have been avoided by not trying to be smart

• Array corruption (never verified – controller replaced)– Write programs to exercise and verify new arrays


Sleepers

• Introduction of latent problem long before manifestation makes recovery not an option

• Changes to many dictionary tables only read at initialization

• Extremely insidious; no good recovery path other than BBED


Time Spent

Top time sinks for this outage• 34% Applying redologs• 23% Waiting for Oracle Support• 12% Troubleshooting• 10% Testing• 8% Working with Oracle Support• 5% Restoring from tape• 5% Taking cold backups• 2% Kernel panic/reboot• 1% Initial outage

Applying redologs

Cold backup

Initial outage

Kernel Panic/Reboot

Restoring from tape

Testing

Troubleshooting

Waiting for Oracle Support

Working with Oracle Support

Sum of Minutes

Activity


Redolog apply time – 34%

• Lack of understanding of the problem

• Lack of more recent backup

• Ignorance of parallel recovery– Learn about new features– Practice variety of recoveries frequently– Tune recovery: wait events– Become an expert in recovery


Waiting for Oracle Support – 23%

• Managing support is a science• Create your own TAR to avoid errors

– MetaLink iTAR– Dial-in to TAR entry system

• Upload files to FTP site before calling in• Call in sev-1 and judge analyst’s skill• Escalate immediately if necessary• Find out when shift ends• Watch analyst annotations in TAR and add

your own• Make your own judgments on validity of

solution


Troubleshooting – 12%

• Too much time spent fiddling– Should establish rules and time limits– When in doubt start the TAR while

troubleshooting– Escalate to other staff quickly


Time Spent

Top time sinks for this outage• 34% Applying redologs• 23% Waiting for Oracle Support• 12% Troubleshooting• 10% Testing• 8% Working with Oracle Support• 5% Restoring from tape• 5% Taking cold backups• 2% Kernel panic/reboot• 1% Initial outage

Applying redologs

Cold backup

Initial outage

Kernel Panic/Reboot

Restoring from tape

Testing

Troubleshooting

Waiting for Oracle Support

Working with Oracle Support

Sum of Minutes

Activity


Lessons• What technology would have helped?

– RAC?– DataGuard?– Backups?

• Hypothetical belt & Suspenders, helmet, etc.?

Zero data loss DataGuard


Lessons

• Most valuable availability assets:– Solid understanding of internals– Support evasion countermeasures– Excellent support contract and personal contacts

• Attend conferences and seek to meet Oracle dev folks

• Never do anything to SYS-owned objects without complete understanding

• Perform testing on new equipment before placing in production

• Troll MetaLink, mailing lists and newsgroups for reports of trouble with new features


Real Failures

• Real failures never resemble the examples in the RAC and DataGuard marketing literature

• Real failures are weird, unpredictable and require fast critical thinking

• Invest in equipment conservatively and srategically

• Invest in education liberally


Future Disaster Diary Chapters

• Share your disasters with me– Anonymity– Acknowledgements– Preferably Oracle-related


• Half-day Crisis Management and Disaster Recovery Seminars

• Week-long Basic and Advanced Oracle Administration Classes

• Remote Emergency DBA Services

www.speakeasy.net/[email protected]

http://www.speakeasy.net/~jwilton




Q & A

Documents

Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. The DBA Disaster Diary Real-world Oracle failures: How