Upload
bernard-robbins
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
The DBA Disaster Diary
Real-world Oracle failures:
How they were resolved,
and how they could have been prevented
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Jeremiah Wilton, OCMwww.speakeasy.net/[email protected]
Independent Oracle ProfessionalSeattle, Washington
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
• Many companies pursuing HA• Prolonged projects• Researching, designing and
implementing• RAC• Dataguard• Clusters• Remote DR sites
• Huge capital expenditures
High Availability
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Benefits of HA Big-Iron4-week periods around rollout
0
2
4
6
8
10
12
Hoursdown
Before HA During After HA
Planned
Unplanned
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
For my average outage in the past 12 months, would RAC or Dataguard have
helped me?
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Hypothetical #1
• 2-node RAC cluster• Array controller fails• Replacement parts 6
hours away• No value from RAC in
this outage• Dataguard might have
helped…
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Hypothetical #2
• Same controller failure with 2-site Dataguard• Standby is missing last few transactions
unless using expensive zero data loss• Management: “Wait for the part, save the
data.”• Dataguard provides no value in this outage
DataGuard
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Hypothetical #3
• Same 2-site Dataguard• Developer’s backfill is missing a where
clause• All customer names updated to “Fred
Flintstone”• Developer takes 30 minutes to realize the
mistake and alert the DBA• Too late; the standby has committed the
changes• Dataguard provides no value in this outage
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Solutions for hypothetical #3
• Flashback Query– Extension of CR mechanism– SMU only– Retention period
• Logminer– Generate undo SQL– Apply one rowid at a time.
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Hypothetical #4
• RAC• Zero data-loss dataguard w/3-day apply delay• SMU for flashback query• Logminer at the ready
Zero data loss DataGuard
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Hypothetical #4
• Belt• Suspenders• Duct tape• Dry suit• Kevlar helmet
Welcome to the Disaster Diary…
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Chapter 1: Double Failure
Setting: Internet StartupCirca: June 1997Staff: 1 novice DBA, 1 systems engineerHardware: DEC Alphaserver 4800 8 procsStorage: 2 DEC Storageworks HSZ50 arrays of
100Gb of RAID0+1Operating System: Digital Unix 4.0BOracle version: 7.3.2.3Oracle Support: Silver (US)Backups: Daily hot tablespace backup to DLT tape
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Some company background
• Independent thinkers
• Intelligent• Specific
knowledge, certifications less important
• Critical thinking• Work ethic
• New DBA after two years of developer management
• Developers had managed well– Hot backups– v$sesstat tuning tools– v$session_event tools– Lock monitoring– Space management
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Vigilant at investigating new features
• 7.3 Maxextents unlimited• Prevent senseless DML failures
select 'alter '||segment_type||' '||segment_name||' storage (maxextents unlimited);'from dba_segmentswhere segment_type in ('TABLE','INDEX');
• Seemed to work well…
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Planned Outage
• During the DBA’s third week• Modify and add some tables (simple
DDL)• Planned for late Thurs. PM/Fri. AM
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
DLT problems
• Thursday AM before outage, backup failed with a DLT drive error:/opt/app/oracle/product/7.3.2/local/bin/gnutar: Cannot write to /dev/nrmt0h: I/O
error
BUS free ERROR - os_std, os_type = 11, std_type = 10 (from uerf)
• Outage on Fri. AM during backup window
• Last good backup before outage is Wed. AM.
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
15-Minute Outage
1. Put out the closed sign on the website2. Shut down the listeners3. Shut down the instance (immediate)4. Start up the instance in restrict mode5. Make schema changes6. Start up the listeners7. Restart the webserver software so it reconnects
to Oracle8. Make sure the site is working internally9. Take down the closed sign and let the public
back in
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
The Outage
• Friday 01:00SE begins outage steps; gets stuck reopening database with ORA-00600 [4000].
• Friday 01:15SE trying to figure out the problem. – Checks host logs – alert file– trace file from the ORA-00600– Searches Alta Vista and newsgroups for similar occurrences; no
positive results.
• Friday 01:45SE in Silver Support severity-1 call queue.
01:00 06:00 12:00Friday
18:00 00:00 06:00 12:00Saturday
18:00 00:00 06:00Sunday
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
The Outage
• Friday 02:00SE gets Oracle Support in the UK. Analyst opens a TAR– Requests alert and trace files via email.– Will call back when received
• Friday 02:45Oracle Support has not called back– SE calls back in on the TAR– Analyst says she did not receive any email! “Sometimes their email
systems can be slow.”– Around this time, SE receives his email bounced by Oracle’s mail
server for being too large.
Oracle and SE agree on FTP – SE uploads the files– Support analyst will examine and call back shortly
01:00 06:00 12:00Friday
18:00 00:00 06:00 12:00Saturday
18:00 00:00 06:00Sunday
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
The Outage
• Friday 04:30DBA, CTO, VP Tech. arrive. Site now down 3.5 hours.– VP starts calling people at Oracle – The DBA is horrified. – DBA & SE call Oracle Support again since no callback yet– Support says “corruption of some sort”; restore from backup and roll
forward
• Friday 05:00DBA & SE begin restore
• Friday 06:15Restore completes– DBA & SE begin roll forward through 48 hours of logs– Unaware of parallel recovery feature; recovering serially
• Friday 06:45– Recovery time projected to take nine hours
01:00 06:00 12:00Friday
18:00 00:00 06:00 12:00Saturday
18:00 00:00 06:00Sunday
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
The Outage
• Friday 12:00A “friend” of the company knows Ellison; calls him;– TAR transferred to bug diagnostics and escalation (BDE)– Local Oracle office offers to send personnel on site
• Friday 15:30Roll forward completes. A cold backup is taken before attempting to open the database.
• Friday 16:00• Recovered DB fails on open with ORA-00600 [4000] again
– Restore/recover had no effect
01:00 06:00 12:00Friday
18:00 00:00 06:00 12:00Saturday
18:00 00:00 06:00Sunday
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
The Outage
• Friday 16:4516 hours into the outage– BDE divulges a known bug:
[BUG:434596] prevents a database from being opened if BOOTSTRAP$ has its MAXEXTENTS value modified. Although it is unlikely anyone would intentionally change this, a number of users have run scripts to change the storage on ALL tables and in doing so have encountered this problem. Fixed in 7.3.4.
– Restore/Recover was a waste of time– Original Support recommendation lacked specific
understanding of problem
Roll forward did nothing to fix several-week-old BOOTSTRAP$ extent modification
01:00 06:00 12:00Friday
18:00 00:00 06:00 12:00Saturday
18:00 00:00 06:00Sunday
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Remain calm, think clearly
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Macro timeline
• This is a “sleeper”– Data-related problem introduced many backup
cycles ago– Sits on disk until database restart– Too much transpired since last “good” backup– No real recovery strategy available to end user
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
The Outage
• Friday 17:30BDE requests dial-in access– “Work” copy of the DB placed new DEC Storageworks HSZ50
array for BDE– Original broken DB preserved in place
Local Oracle office says on-site support analyst on the way
• Friday 18:00Oracle Support BDE dialed in and working
• Friday 19:00On-site Oracle Support arrives
01:00 06:00 12:00Friday
18:00 00:00 06:00 12:00Saturday
18:00 00:00 06:00Sunday
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
The Outage
• Friday 19:53Media starts to take notice; Reuters report leads to articles in:– Wall Street Journal– TV networks– Online news sites
Web site suffers outageLOS ANGELES, June 6 (Reuter) - The World Wide Web site of [the company] went down Friday, preventing Internet customers from ordering … from the online retailer. According to a notice posted on the company's Web page, work was being done on the system and customers should try back later. A spokesman for [the company] was not immediately available to comment on the reason for the outage or how long it would last.
• Friday 22:00BDE uses BBED to patch blocks of BOOSTRAP$’s extent map in datafile 1 of SYSTEM tablespace
01:00 06:00 12:00Friday
18:00 00:00 06:00 12:00Saturday
18:00 00:00 06:00Sunday
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
The Outage
• Friday 22:15DB successfully opens!– BDE repaired DB on
new storage array– Now running on the
new array
• Friday 22:30Sanity testing begins
• Friday 23:00DB host crashes with Unix kernel fault!– 20 minutes to reboot
• Friday 23:30Site back up and available on the Internet
Total outage time: 22 hours and 30 minutesReduced YTD availability from 99.986% down to 99.729%
• Saturday 00:00Hot backup of repaired DB started
• Saturday 01:00Everyone goes home.
01:00 06:00 12:00Friday
18:00 00:00 06:00 12:00Saturday
18:00 00:00 06:00Sunday
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
The Outage
• Saturday 06:15DBA & SE paged with corruption errors in alert file:
Corrupt block dba: 0x4c0530a6 file=19. blocknum=340134. found during buffer read on disk type:6. ver:1. dba: 0x4c0530a6 inc:0x000004e0 seq:0x0000026c incseq:0x00110000
Reread of block=4c0530a6 file=19. blocknum=340134. found same corupted data
Errors coming fast, but website is still up– Small number of read failures
• Saturday 06:45SE back on phone with OracleDBA & CTO on their way back
• Saturday 08:00DBA arrives and joins call with Oracle
– BDE engaged– Everyone suspects Oracle problem related to block-munching by BDE– Non-website internal tools are disabled– BDE looks at block dumps; log in to examine the problem
01:00 06:00 12:00Friday
18:00 00:00 06:00 12:00Saturday
18:00 00:00 06:00Sunday
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
The Outage
• Saturday 12:00BDE says new RAID array randomly and steadily corrupting blocks
The DBA is horrified.
SE makes a point:– Although blocks on the disk getting corrupted, redologs are on good array– Redologs not getting corrupted– No corrupted data being read and rewritten
DBA & SE conclude that a good version of each successful transaction preserved in the redologs
Middle of day and peak website period– DBA & SE decide to keep the database up– Will run and in hobbled state until period of low activity– Then restore most recent backup to working array & roll forward
01:00 06:00 12:00Friday
18:00 00:00 06:00 12:00Saturday
18:00 00:00 06:00Sunday
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
The Outage
• Saturday 16:30DBA & SE copy cold backup from before first BDE patch to good array– Most recent hot backup taken from bad RAID array– Previous one taken after initial restore and recover but prior to
patching– BDE will have to log in and re-patch this copy
• Saturday 21:00Restore & recover in parallel with running DB– DBA & SE start applying Saturday’s logs to the restored database– Corrupting copy still running– Parallel recovery this time
• Sunday 03:00Recovery catches up to the corrupted open production DB– BDE logs in and reapplies block patches for BOOTSTRAP$ bug– Recovered copy still not open; Corrupted copy still running
01:00 06:00 12:00Friday
18:00 00:00 06:00 12:00Saturday
18:00 00:00 06:00Sunday
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
The Outage
• Sunday 04:00Site and running corrupted DB taken down– DBA & SE take cold backup of recovered and repaired database– Last few logs generated by the corrupted system applied
• Sunday 06:00Repaired DB opened; testing begins
• Sunday 08:50Testing complete; Site back up
• Sunday 09:00CEO arrives and takes pictures with disposable camera
• Sunday 10:00VP Tech. takes IT staff out for breakfast but nobody hungry
01:00 06:00 12:00Friday
18:00 00:00 06:00 12:00Saturday
18:00 00:00 06:00Sunday
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Causes of outage,Avoidance strategies
• BOOTSTRAP$ bug “sleeper”– Watch for bugs (no good way)– Doing things to dictionary tables (not documented)– Reasonably, this was an unavoidable problem– Could have been avoided by not trying to be smart
• Array corruption (never verified – controller replaced)– Write programs to exercise and verify new arrays
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Sleepers
• Introduction of latent problem long before manifestation makes recovery not an option
• Changes to many dictionary tables only read at initialization
• Extremely insidious; no good recovery path other than BBED
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Time Spent
Top time sinks for this outage• 34% Applying redologs• 23% Waiting for Oracle Support• 12% Troubleshooting• 10% Testing• 8% Working with Oracle Support• 5% Restoring from tape• 5% Taking cold backups• 2% Kernel panic/reboot• 1% Initial outage
Applying redologs
Cold backup
Initial outage
Kernel Panic/Reboot
Restoring from tape
Testing
Troubleshooting
Waiting for Oracle Support
Working with Oracle Support
Sum of Minutes
Activity
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Redolog apply time – 34%
• Lack of understanding of the problem
• Lack of more recent backup
• Ignorance of parallel recovery– Learn about new features– Practice variety of recoveries frequently– Tune recovery: wait events– Become an expert in recovery
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Waiting for Oracle Support – 23%
• Managing support is a science• Create your own TAR to avoid errors
– MetaLink iTAR– Dial-in to TAR entry system
• Upload files to FTP site before calling in• Call in sev-1 and judge analyst’s skill• Escalate immediately if necessary• Find out when shift ends• Watch analyst annotations in TAR and add
your own• Make your own judgments on validity of
solution
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Troubleshooting – 12%
• Too much time spent fiddling– Should establish rules and time limits– When in doubt start the TAR while
troubleshooting– Escalate to other staff quickly
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Time Spent
Top time sinks for this outage• 34% Applying redologs• 23% Waiting for Oracle Support• 12% Troubleshooting• 10% Testing• 8% Working with Oracle Support• 5% Restoring from tape• 5% Taking cold backups• 2% Kernel panic/reboot• 1% Initial outage
Applying redologs
Cold backup
Initial outage
Kernel Panic/Reboot
Restoring from tape
Testing
Troubleshooting
Waiting for Oracle Support
Working with Oracle Support
Sum of Minutes
Activity
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Lessons• What technology would have helped?
– RAC?– DataGuard?– Backups?
• Hypothetical belt & Suspenders, helmet, etc.?
Zero data loss DataGuard
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Lessons
• Most valuable availability assets:– Solid understanding of internals– Support evasion countermeasures– Excellent support contract and personal contacts
• Attend conferences and seek to meet Oracle dev folks
• Never do anything to SYS-owned objects without complete understanding
• Perform testing on new equipment before placing in production
• Troll MetaLink, mailing lists and newsgroups for reports of trouble with new features
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Real Failures
• Real failures never resemble the examples in the RAC and DataGuard marketing literature
• Real failures are weird, unpredictable and require fast critical thinking
• Invest in equipment conservatively and srategically
• Invest in education liberally
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Future Disaster Diary Chapters
• Share your disasters with me– Anonymity– Acknowledgements– Preferably Oracle-related
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
• Half-day Crisis Management and Disaster Recovery Seminars
• Week-long Basic and Advanced Oracle Administration Classes
• Remote Emergency DBA Services
www.speakeasy.net/[email protected]
Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.
Q & A