Tungsten RCA Report: Q1 2012 1 Created: 2012-02-08
Stuart Kendrick Updated: 2012-02-09
Tungsten RCA Report
Focus on 2012-01-10
Contents
Overview ......................................................................................................................................... 3
Summary ......................................................................................................................................... 3
Outstanding ..................................................................................................................................... 4
Team ............................................................................................................................................... 4
#1 Sanity Check Disk Failure Rate ................................................................................................. 4
Failed Disk History ..................................................................................................................... 4
Disk Failure Rate ........................................................................................................................ 5
Temperature vs Failure Rate ....................................................................................................... 5
Likelihood of a disk failure resulting in Catastrophic Failure .................................................... 6
Consult the Literature ................................................................................................................. 7
Further reading ........................................................................................................................ 8
#2 Consolidated Storage Major Events Timeline ........................................................................... 9
#3 Cobalt Stumbles ....................................................................................................................... 15
#4 Tungsten Failovers Bumpy ...................................................................................................... 15
High-Level ................................................................................................................................ 15
Detail ......................................................................................................................................... 15
Long NetApp Failover .............................................................................................................. 17
#5 Different Fates ......................................................................................................................... 19
Overview ................................................................................................................................... 19
Outline....................................................................................................................................... 19
Clumps ...................................................................................................................................... 21
SMB Clients .......................................................................................................................... 21
NFS Clients ........................................................................................................................... 21
iSCSI Clients ......................................................................................................................... 21
#6 Communication ........................................................................................................................ 22
Tungsten RCA Report: Q1 2012 2 Created: 2012-02-08
Stuart Kendrick Updated: 2012-02-09
Tables
Table 1: Summary of Answers ...................................................................................................... 3
Table 2: Outstanding Questions ..................................................................................................... 4 Table 3: Team Statistics ................................................................................................................. 4 Table 4: Drive Cage Temperatures ................................................................................................. 6 Table 5: Drive Failure and Position ............................................................................................... 6 Table 6: Time Between Failures .................................................................................................... 7
Table 7: Behavior by Client ......................................................................................................... 20 Table 8: Why Clients Respond Differently ................................................................................. 20
Tungsten RCA Report: Q1 2012 3 Created: 2012-02-08
Stuart Kendrick Updated: 2012-02-09
Overview This document summarizes the RCA team’s findings around the Wednesday January 10th
incident, focused on the following questions.
1. Sanity-check Cobalt disk failure rate against industry averages
2. Build a timeline for the week surrounding the event
3. Explain why Cobalt disk maintenance triggers Tungsten failovers
4. Explain why Tungsten failovers are bumpy
5. Explain why different clients/services suffer different fates
6. Describe how we communicated to end-users and what we might do differently next time
After initiation, leadership asked the team to add the following. The team did not reach these.
7. Explain how Tungsten’s CPU utilization plays into failover scenarios
8. Explore the consequences of breaking Tungsten HA
Duration: Wednesday January 18th – Thursday February 9th
For detailed explanation and recommendations around the Enterprise MSSQL Clusters
(Harvard/Princeton), see Enterprise-MSSQL-Clusters-Harvard-and-Princeton-Storage-
Events.doc from Don Butler.
Summary Defect Bug, configuration error, suboptimal configuration choice
Table 1: Summary of Answers
Question Answer
#1: Disk Failure Rate Within industry norms; exacerbated by known defects in original disk
inventory
#2: Timeline See Timeline section
#3: Cobalt Stumbles In progress
#4: Tungsten Failovers Multiple influences, some still under investigation, others induced by
defects in clients and cockpit error
#5: Different Fates Two of Tungsten’s services failed entirely, knocking some clients
off-line completely. For the rest, client technology varies, with
varying resilience to pathology, resulting in varying responses to
disruption
#6: Communication See Timeline and Communication sections
Tungsten RCA Report: Q1 2012 4 Created: 2012-02-08
Stuart Kendrick Updated: 2012-02-09
Outstanding This is what we haven’t finished.
Table 2: Outstanding Questions
Question Detail
#1: Disk Failure
Rate
What are the odds of experiencing a catastrophic disk failure on Cobalt,
such that we would lose most or all of the volumes it services?
#3: Cobalt Stumbles Why does Cobalt spike latency when it loses a disk? Why does
Tungsten notice, often to the extent of inducing a multi-disk panic,
while Silo and Fred ride through these events without noticing?
#4: Tungsten
Failovers
Why do our NetApps in general sometimes require an unusually long
time to failover?
#7: CPU Utilization How does Tungsten’s CPU affect failover scenarios?
#8: Tungsten HA Explore the consequences of breaking Tungsten HA: would this help?
Team SME = Subject Matter Expert
[Extract Orwell report next week --sk.]
Table 3: Team Statistics
Member Role Hours
Don Butler SME: Harvard / Princeton
Nancy Crase SME: Communication
Ken Kawakubo SME: Storage
Stuart Kendrick Facilitator + Team Lead
Robert McDermott SME: Enterprise
Randy Rue SME: Storage
#1 Sanity Check Disk Failure Rate Cobalt Disk Failure Analysis Robert McDermott
Failed Disk History The Center’s 3PAR T800 system named Cobalt has suffered 40 disk failures since it was put into
production status on December 10th 2009. Technically there were only 37 failed disks as 3 of the
Tungsten RCA Report: Q1 2012 5 Created: 2012-02-08
Stuart Kendrick Updated: 2012-02-09
40 were marked as failed when the magazine containing them failed1. Figure 1-1 shows Cobalt’s
disk failure time line.
Figure 0-1. Cobalt disk failure history.
Disk Failure Rate The year 2011 is when the system started to see considerable use and when it reached its full 528
disk capacity. Cobalt lost 23 disks in 20112 bringing the annual disk failure rate to 4.36%. This is
inline with some reported industry figures but is much higher than any other storage systems we
manage.
Temperature vs Failure Rate The unique disk cage architecture of the 3PAR is such that there are drives near the front of the
disk cage (cool) or near the back of the cage (hot). We looked at the position of the failed disk to
see if temperature played a significant role in disk failure rates. If high temperature makes disk
fail we should see that disks located closer to the back of the disk cage fail more often. Table 4
shows the dramatic 19F degree temperature difference between the front and back disks.
1 PD 333 failed on 2011-05-20, and the magazine failed on 2011-05-21 causing the untimely failure of the remaining
disk in the magazine (332, 334 and 335). 2 See footnote above for explanation as to why I could 23 disk failures not 26.
Tungsten RCA Report: Q1 2012 6 Created: 2012-02-08
Stuart Kendrick Updated: 2012-02-09
Table 4: Drive Cage Temperatures drive cage coolest (front disk 3) hotest (back disk 0)
0 77 95
1 77 95
2 77 95
3 77 95
4 80.6 98.6
5 77 96.8
6 75.2 98.6
7 80.6 100.4
8 78.8 95
9 78.8 98.6
10 80.6 100.4
11 78.8 96.8
12 75.2 91.4
13 75.2 93.2
14 73.4 95
15 75.2 93.2
16 73.4 91.4
17 73.4 91.4
18 73.4 91.4
19 73.4 93.2
average 76.55 95.27
Table 5 shows that the hottest disks didn’t have a failure rate higher than the coolest disks. More
of the coolest disks failed than any other temperate but it’s not likely statistically significant.
Table 5: Drive Failure and Position
disk position failed disk count
0 9
1 9
2 7
3 12
Likelihood of a disk failure resulting in Catastrophic Failure The 3PAR doesn’t use traditional disk based raid groups, but instead makes use of 256MB
objects called “chunklets”. This allows the system to start repairing itself immediately after a
drive has failed rather than weighting for a replacement disk to be installed in the system.
All of the RAID group on Cobalt on RAID-6 with a set size of 14+2. The volumes have been
configured to be able to survive a failure of any one of the disk magazines (4 disk per magazine).
The volumes can survive the loss of any two chunklets per RAID group; each volume is
comprised of many RAID groups.
The volumes on Cobalt should be able to handle the loss of any individual disk magazine or two
disks in separate magazines. Volume failure and data loss can be expected if three drives, sitting
in two or three different magazines, fail in a short amount of time before the system has
protected itself from the first failure. We currently don’t have an accurate timing on the time it
Tungsten RCA Report: Q1 2012 7 Created: 2012-02-08
Stuart Kendrick Updated: 2012-02-09
takes for the system to restore full RAID-6 protection after a disk failure; the longer this window,
the greater the risk.
Given all the above and the fact that the drives often fail in clusters, I lack the abilities to produce
a mathematical model to determine the probability of catastrophic failure. Table 4-1 shows the
time between all disk failures (except 3 due to magazine failure). Even if the system takes 12
hours to restore full RAID-6 protection we would have had two events where a second disk
failed before the system had restored full data protection. Still assuming 12 hours to restore
protection, there would have been two periods where we were at risk of a third disk failure taking
out the volumes.
Table 6: Time Between Failures Between Hours
F1 – F2 1052.1
F2 – F3 2032.9
F3 – F4 45.88
F4 – F5 38.46
F5 – F6 1848.46
F6 – F7 1135.45
F7 – F8 833.66
F8– F9 30.4
F9 – F10 1481.03
F10 – F11 30.32
F11 – F12 1043.18
F12 – F13 12.6
F13 – F14 144.45
F14 – F15 180.16
F15 – F16 85.12
F16 – F17 582.76
F17 – F18 435.72
F18 – F19 1684.97
F19 – F20 41.28
F20 – F21 11.85
F21 – F22 39.87
F22 – F23 60.27
F23 – F24 8.3
F24 – F25 181.23
F25 – F26 875.93
F26 – F27 111.92
F27 – F28 17.07
F28 – F29 1119.7
F29 – F30 54.47
F30 – F31 243.92
F31 – F32 40.66
F32 – F33 12.95
F33 – F34 39.81
F34 – F35 450.35
F35 – F36 35.23
F36 – F37 46.35
Consult the Literature Stuart Kendrick
“Are Disks the Dominant Contributor for Storage Failures?
A Comprehensive Study of Storage Subsystem Failure Characteristics”
Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky
Tungsten RCA Report: Q1 2012 8 Created: 2012-02-08
Stuart Kendrick Updated: 2012-02-09
[University of Illinois + Network Appliance Inc]
Presented at the UseNix Conference on File and Storage Technologies 2008
http://www.usenix.org/event/fast08/tech/full_papers/jiang/jiang.pdf
Large Data Set
Abstract
[…]
This paper analyzes the failure characteristics of storage subsystems. More specifically, we
analyzed the storage logs collected from about 39,000 storage systems commercially deployed at
various customer sites. The data set covers a period of 44 months and includes about 1,800,000
disks hosted in about 155,000 storage shelf enclosures.
[…]
Failure Rates and Reasons
Generally, disks experience an annualized failure rate of <1%. However, in the field, customers
replace disks at rates ranging from 2-4%, for various reasons explained in the paper.
Storage typically fails for a range of reasons, of which disk failure is a common but not dominant
factor).
In addition to disk failures that contribute to 20-55% of storage subsystem failures, other
components such as physical interconnects (including shelf enclosures) and protocol stacks also
account for significant percentages (27-68% and 5-10%, respectively) of failures.
Clustered failures are common
Each individual storage subsystem failure type and storage subsystem failure as a whole exhibit
strong correlations, (i.e. after one failure, the probability of additional failures of the same type
is higher). In addition, failures also exhibit bursty patterns in time distribution, (i.e. multiple
failures of the same type tend to happen relatively close together).
Further reading Disk Failures
http://www.cs.cmu.edu/~bianca/project_pages/project_reliability.html
Empirical measurements of disk failure rates and error rates
http://arxiv.org/ftp/cs/papers/0701/0701166.pdf
Reliability analysis of disk drive failure mechanisms
http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CC0QFjAA&ur
l=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.160.161
8%26rep%3Drep1%26type%3Dpdf&ei=bGYQT5rrA6baiQLemqmpDQ&usg=AFQjCNFyuB27
Q8o7wYw8Ga1i8qZxDSWt6A
Latent Sector Errors
http://www.usenix.org/event/fast07/tech/full_papers/pinheiro/pinheiro.pdf
Tungsten RCA Report: Q1 2012 9 Created: 2012-02-08
Stuart Kendrick Updated: 2012-02-09
Vibration
http://www.usenix.org/event/sustainit10/tech/full_papers/turner.pdf
Using Syslog Messages for Predicting Disk Failures
http://www.usenix.org/event/lisa10/tech/slides/featherstun.pdf
System Impacts of Storage Trends: Hard Errors and Testability
http://www.usenix.org/publications/login/2011-06/pdfs/Hetzler.pdf
Disks are like snowflakes: no two are alike
http://www.usenix.org/events/hotos/tech/final_files/Krevat.pdf
An Analysis of Data Corruption in the Storage Stack
http://www.usenix.org/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram_html/
#2 Consolidated Storage Major Events Timeline Ken Kawakubo
Tungsten RCA Report: Q1 2012 10 Created: 2012-02-08
Stuart Kendrick Updated: 2012-02-09
12/3/2009 2/5/2012
1/1/2010 4/1/2010 7/1/2010 10/1/2010 1/1/2011 4/1/2011 7/1/2011 10/1/2011 1/1/2012
Consolidated Storage Major Events Timeline
12/3/2009
CS goes live with 222TB usable space 10/1/2010
1st Cobalt Expansion to 308TB usable space
1/9/2011
2nd Cobalt Expansionto 398TB usable space
4/25/2011
3rd and final Cobalt ExpansionTo 590TB usable space
9/18/2011
512GB FlashCachemodules installedon Tungsten
11/12/2010
Tungsten-A bad DIMMtook down Tungsten-AFor 12 hours
11/4/2010
· Fred lost one virtual disk from Cobalt(cockpit error)
· Took a month+ to restore data from tape
3/1/2011
Tungsten-A CPU usagenearing 100%
9/21/2011
· Silo lost S drive provisioned from Cobalt· TSM backup not kept up and lost some data
3/23/2011
· Cobalt lost 2 disks and Tungsten-A had its first MDP(multi-disk failure panic) failover
· No data lost but took 2 days to recover
April 2011
· Efforts to lower the load on Tungsten-A started· 10Gbps links added to Tungsten-b so that Tungstens can be active-active· Vcolodata vfiler moved from Tungsten-A to Tunsten-b· SIS (dedupe) disabled on most volumes· By the end of April, the CPU utilization on Tungsten-A stabilized at about 70 percent.
January 2010
Steady migration ofdata to CS started.
1/10/2012
· Tungsten-A had its 2nd MDP (multi-disk failure panic) failover· No data lost but over 100 hours of Ops time to restore services
January 2012
Tungsten-A CPU usageback to 100%
2/1/2012
Tungsten-A had its 3rd MDP(multi-disk failure panic) failover
kkawakub 2012-02-04
Tungsten RCA Report: Q1 2012 11 Created: 2012-02-08
Stuart Kendrick Updated: 2012-02-09
Date Event
2009-12-03 CS goes live. Cobalt (3PAR T800) started out with 2 cabinets with 4 controller nodes and
8 shelves (cages). (288) 1TB SATA drives provided 288TB raw or 222TB usable space.
2010 Steady data migration of data to CS started.
2010-02-03 1st Cobalt disk failure
2010-04-08 2nd Cobalt disk failure
2010-08-08/12 3 Cobalt disks failed
2010-10-01 1st Cobalt Expansion - added 1 cabinet, 2 controller nodes, and 7 cages (shelves). (56) 2TB drives (in 14 magazines)
expanded the total raw capacity to 400TB or 308TB usable space
2010-10-12 1 Cobalt disk failed
2010-11-04 Fred lost one virtual disk provisioned from Cobalt. It took over a month to restore data from TSM backup tapes. 1st
major CS-related outage. 50 hours infra Ops (DAS and data protection) time
2010-11-12/13 Network configuration change work on Tungsten required a cluster failover. After the failover event, Tungsten-a
did not boot due to 2 bad DIMMs. Correct DIMMs did not arrive until next morning. Also, some iSCSI LUNs went
offline. 12 hours Infra Ops time
2010-12-15 1 Cobalt disk failed
2011-01-09 2nd Cobalt Expansion – added (60) 2TB drives (in 15 magazines) and expanded the total raw capacity to 520TB or
398TB usable.
2011-01-18/19 2 Cobalt disks failed
2011-02-20 Tungsten Ontap upgraded to 7.3.5P1 to address network configuration and CIFS issues. During the upgrade, major
systems such as Zimbra and Enterprise SQL cluster were shut down in an orderly way to avoid their going offline.
Went mostly smoothly. Used efnet IRC chat for communication among administrators.
2011-03 Tungsten-a CPU usage nearing 100%.
2011-03-20 SCHARP migrates to Consolidated Storage.
2011-03-21 SCHARP users unable to employ CS on account of Firewall-based NATing; configuration change to Tungsten
fixes this. Firewall-induced slowdown plus high Tungsten CPU utilization continue to impact SCHARP.
2011-03-23/25 Cobalt lost 2 disks and Tungsten-a experienced very high CPU usage. Tungsten had its first MDP (multi-disk
failure panic) failover. Vendors’ RCA were not conclusive but pointed to the fact that 2011-03-02, 3 Cobalt virtual
luns are each assigned duplicate disk IDs.
No data was lost.
Tungsten RCA Report: Q1 2012 12 Created: 2012-02-08
Stuart Kendrick Updated: 2012-02-09
20 vColo hosts hung require reboot
1 vColo host requires restore from snapshot
Three databases on Enterprise SQL Cluster require rebuilding
50 hours Infra Ops time
20 hours NAG tech time
2011-04 Efforts to lower the load on Tungsten-a started. 10Gbps links are added to Tungsten-b so that Tungstens can be
active-active. Vcolodata vfiler (one of the most CPU-intensive vfiler) is moved from Tungsten-a to Tungsten-b.
SIS (dedupe) is disabled on most volumes, etc. By the end of April, the CPU utilization on Tungsten-a stabilized at
about 70 percent.
2011-04-23 SCHARP Firewall retired.
2011-04-25 3rd and final Cobalt expansion – added two controllers, five cages (shelves) and (124) 2TB disks and expanded the
total raw capacity to 768TB or 590TB usable.
2011-05-06 2 Cobalt disks failed
2011-05-12 1 Cobalt disk failed
2011-05-20/23 3 Cobalt disks failed
2011-06-25 Completed adding disks / shuffling space (dynamic optimization) on Cobalt.
2011-07-05 1 Cobalt disk failed
2011-07-19 From 6-8pm, Tungsten-a’s CPU3 (Kahuna) was pegged at 100% while the other three CPUs stayed idle. First
“brownout” incident.
2011-08-15 SCHARP reported Tungsten performance issues. This incident turned out to be caused by faulty hardware in the
network.
2011-08-25/30 Cobalt port 4:0:1 CRC errors. Replaced fibre cables, transceivers, FCAL, and HBA.
2011-09-07/09 2 Cobalt disks failed
2011-09-13/20 4 Cobalt disks failed.
2011-09-18 Per recommendation from “Consolidated Storage Performance Analysis”, 512GB FlashCache modules are installed
on the Tungsten pair. At the same time, Ontap was upgraded to 7.3.5.1P5 to address potentially critical
vulnerability to unexpected failovers during disk errors. The failover was orderly and did not cause any issue except
one test iSCSI LUN we left on for testing. This LUN went missing and needed to be restored from backup.
2011-09-21 Silo lost S drive that was provisioned from Cobalt. TSM backup had not been keeping up with Silo changes and
some data was lost. This incident led to the changes in how TSM backed up Silo volumes (use of Journaling).
Tungsten RCA Report: Q1 2012 13 Created: 2012-02-08
Stuart Kendrick Updated: 2012-02-09
2011-09-27 1 Cobalt disk failed
2011-10/11 It became clear Cobalt will run out of space by the end of the year. Sun Thumpers were deployed and used as iSCSI
targets for Silo to give temporary space relief.
2011-11-03/04 2 Cobalt disks failed
2011-11-07/08 2 Cobalt disks failed
2011-12-25/27 2 Cobalt disks failed
2012-01-06 1 Cobalt disk failed
2012-01-10 Tungsten-a had its 2nd MDP (multi-disk failure panic) failover
Detailed sequence of events
11:45am HP tech arrives
12:07pm rrue alerts SOPS to maintenance via pager
12:04-12:13pm HP tech evacuates data from magazine
12:17pm HP tech removes magazine, replaces drive
12:21pm HP tech re-inserts magazine
12:31pm First of ~40 Nagios SERVICE ALERT related to the incident – zimbra-mta1
12:31pm SQL Cluster started complaining about I/O requests taking longer than15 sec
12:32pm Another Cobalt disk failed
12:33pm Tungsten starts error-recovery procedures toward Cobalt
12:33pm Tungsten starts warning that it is losing its failover capabilities
12:34pm Numerous clients start warning of difficulties reaching Tungsten
12:50pm Cobalt FC ports 4:5:1 and 5:5:1 which connect to Tungsten-a port 0a and 0c respectively start SCSI resets (serious
Fibre Channel issues).
12:53pm Multi-disk failure on Tungsten-A: Tungsten-B begins takeover
12:54pm Nagios warns of Tungsten-A distress
12:58pm Takeover completes (290 seconds) , but Tungsten-B cannot see the two disks whose loss caused Tungsten-A to
panic: 0c.0L26 and 0c.0L187
12:58pm rrue initiates giveback
13:00pm Nagios reports Tungsten Cluster status is Critical
13:09pm Giveback completes and fails. CIFS is shutdown.
13:15pm rrue engages NetApp TAC
13:18pm “[scicomp-announce] connectivity loss to storage” e-mail sent.
Tungsten RCA Report: Q1 2012 14 Created: 2012-02-08
Stuart Kendrick Updated: 2012-02-09
13:38pm The first CIT outage e-mail about Consolidated Storage sent. This outage notice pointed out that all resources were
running on Tungsten-B except for scharpdata aggregate 1 and adm_home aggregate which contained the missing
logical disks.
13:46pm CIT Project office sent out e-mail “Replicon and RMT currently offline”.
14:00pm FMS_Support sent out e-mail “[FMSlist] FMS Crystal Reporting Portal Down”.
14:19pm The second CIT outage e-mail sent and confirmed that CIFS access to vfilers was not available on Tungsten-b.
14:28pm rrue restarts CIFS
14:55pm The third CIT update sent and informed that CIFS access to all vfilers except for admhome and scharpdata had been
restored.
15:16pm CIT sent out AllHutch e-mail about the outage.
16:23pm CIT sent out an update that it was planning to restore Tungsten-a at 17:30pm.
16:38pm CIT sent out another AllHutch e-mail describing its efforts to restore services. The e-mail also pointed to
http://status.fhcrc.org for further updates.
17:36pm Tungsten cluster giveback started and completed in 71 seconds.
17:40pm After giveback, Tungsten-a found its logical disks.0L26 and 0c.0L187 intact and both scharpdata aggregate 1 and
admhome aggregate were restored and mounted.
17:40pm For some reason, the network configuration of scharpdata aggregate 1 vfiler was incorrect and it took some time to
restore access to the vfiler.
2012-01-11
05:37am It took time to restore Enterprise SQL cluster iSCSI LUNs until early next morning . At this point, the only
remaining service which was down was CPAS on Princeton. CPAS was finally rebuilt on 2012-01-25.
No data was lost. However, this outage took 80 hours of DBA time and 30 hours of rrue’s time.
2012-01-18/20 Snow closes Hutch.
2012-01 Tungsten-a CPU usage was observed to be back to near 100%.
2012-01-29/31 3 Cobalt disks failed
2012-02-01 Tungsten-a had its 3rd MDP (multi-disk failure panic) failover
2012-02-02 Tungsten-a CPU3 (Kahuna) pegged at 100% as the other CPUs stayed idle due to CIFS lock bug.
Tungsten RCA Report: Q1 2012 15 Created: 2012-02-08
Stuart Kendrick Updated: 2012-02-09
#3 Cobalt Stumbles Stuart Kendrick
I acquired sufficient access permissions to engage both HP and NetApp TAC on this starting Monday February 6th; both TACs have
engaged. Preliminary results suggest a mix of pathology contributed by bugs on both manufacturer’s products. More work remains,
digging through the data to which I now have access, responding to TAC queries for more information, coordinating TAC analysis
sessions.
#4 Tungsten Failovers Bumpy Stuart Kendrick Randy Rue Don Butler
High-Level From a naïve point of view, we understand why Tungsten-A passes the buck to Tungsten-B: it loses sight of one or more of the
volumes which Cobalt provides it. Stepping backward in causation, we do not understand why Tungsten loses sight of these volumes
nor why Fred and Silo are unaffected – this is a work-in-progress (see #3 Cobalt Stumbles).
Once Tungsten loses sight of a volume, it cannot provide services to clients wanting that volume; this is why Administrative users lost
access to their home drives and why SCHARP users lost access to everything.3
Detail Additionally, the 2012-01-10 event was particularly bumpy because we drove Tungsten into a ditch. Here’s how that happened.
1. The Cobalt / Tungsten pathology started ~12:30pm.
3 Similarly, during the 2012-02-01 event, this is why CRD users lost access to their home drives.
Tungsten RCA Report: Q1 2012 16 Created: 2012-02-08
Stuart Kendrick Updated: 2012-02-09
2. Tungsten-A Panicked, and Tungsten-B initiated a Takeover, at 12:50pm. This failover took ~5 minutes – unusually long and
enough to break many applications, whose timeouts generally expire after 1-5 minutes. This lengthy Takeover risks data loss
and corruption.
3. Immediately after the failover, ~1:00pm, Randy realized that Tungsten-B was not servicing home_ad1 and scharpdata_aggr1.
4. Wanting to restore service, Randy checked Tungsten-A’s status – Tungsten-A declared that it was ready for a Giveback. Randy
initiated a Giveback ~1:00pm, hoping that Tungsten-A would resume servicing home_ad1 and scharpdata_aggr1. Givebacks
(and Takeovers) involve shutting down SMB, a necessary step on account of this protocol’s fragility.
5. Well, Tungsten-A had Panicked because it couldn’t see home_ad1 and scharpdata_aggr1. The Panic, Takeover, and
subsequent Reboot hadn’t helped … Tungsten-A still couldn’t see these resources. Why did Tungsten-A advertise itself as
ready for a Giveback? It turns out ‘ready for Giveback’ involves a range of sanity checks … but does not include checking for
access to disks – we didn’t know that.
6. The Giveback failed at ~1:10pm, because Tungsten-A still couldn’t see home_ad1 and scharpdata_aggr1.
7. When a Giveback fails, we thought that the backup cluster node would restart SMB. Turns out it doesn’t: ONTAP wants
administrative intervention at this point, so SMB remained stopped, thus disrupting all home and shared drive access for SMB
clients.
8. At this point, Randy had NetApp TAC engaged. He escalated past the front-line to a second tier tech, Mathew Ferguson.
Randy and Matt had trouble communicating: Randy was trying to explain that getting SMB running was his top priority; Matt
was focused on figuring out how to restore access to home_ad1 and scharpdata_aggr1.
9. Ken figured out the problem; Randy restarted SMB on Tungsten-B ~2:30pm, restoring home & shared directory services
(except, of course, for Administrative and SCHARP users).
10. Matt helped Randy restore Tungsten-A’s access to home_ad1 and scharpdata_aggr1.
11. Leadership decided to stall until 5:30pm before trying the next Giveback.
Tungsten RCA Report: Q1 2012 17 Created: 2012-02-08
Stuart Kendrick Updated: 2012-02-09
12. The evening Giveback succeeded: Tungsten-A could see all disks, per earlier work. But, for reasons we don’t understand, the
SCHARP vFiler’s IP address settings vanished. Sounds like a bug in ONTAP; we do not plan to investigate this further. Took
a while to figure this out and fix it.
13. Notice that Fred and Silo were unaffected, as usual, by the Cobalt / Tungsten pathology.
14. Notice also that NFS and iSCSI clients of Tungsten rode through using their usual mechanisms: unlike SMB, both protocols
employ various features to smooth over the bumps induced. Nevertheless, applications are not so forgiving: that initial 5
minute Takeover broke some of them. For example, some vColo Guests needed rebooting in order to recover. As far as we
know, we got lucky: no data loss or corruption.
15. Notice too that Harvard/Princeton crashed for several reasons. For starters, their timeouts are cranked down too low,
somewhere in the ~20 second range. MS SQL Server can survive 60-90 seconds without storage, and this is how Microsoft &
NetApp recommend configuring them. It is possible that Harvard/Princeton were correctly configured when we installed
them … but Microsoft patches, installed subsequently, have rolled back those changes to the default ~20 seconds. Regardless,
the first Takeover took ~5 minutes, so even if the MS SQL Servers (Harvard/Princeton, ~25 others which we have not
analyzed) had been configured correctly, they still would have crashed.
16. In addition, once Harvard/Princeton were restored to life, they could not gain access to one of their volumes – they saw it as
Unavailable. We believe this is due to a bug in Windows Clustering services. The team worked through the night to get
around this bug: nothing new here: we’ve seen this bug before … but Randy discovered an innovative work-around, which
reduced recovery time.
17. Restarting MS SQL Server and repairing the databases took many more hours. We left CPAS down for a week+, on account
of it being big (time-consuming to repair) and less important (end-users could live without it for an extended period of time,
extended by the snow event).
Long NetApp Failover Overly long NetApp failover is not unique to Tungsten; it shows up on Iron as well. I propose to continue investigating this.
# Statistics / NetApp Failovers # # Client Flavor count min max mean mode stddev range
Tungsten RCA Report: Q1 2012 18 Created: 2012-02-08
Stuart Kendrick Updated: 2012-02-09
------------ -------- ----- ------ ------ ------ ------ ------ ------ carbon-a takeover 3 18 24 20 20 3 6 carbon-a giveback 3 2 6 4 4 2 4 carbon-b takeover 2 17 18 17 17 0 1 carbon-b giveback 2 2 7 4 4 3 5 iron-b takeover 2 21 286 153 153 187 265 iron-b giveback 2 30 53 41 41 16 23 tungsten-a takeover 5 6 43 19 19 14 37 tungsten-a giveback 5 21 151 69 69 57 130 tungsten-b takeover 8 26 808 210 210 263 782 tungsten-b giveback 7 12 71 31 16 24 59
# History of NetApp failovers # # Date Time Host Flavor Duration ---------- -------- ------------ -------- -------- 2010-02-07 20:10:33 tungsten-a takeover 6 2010-02-07 20:13:26 tungsten-a giveback 22 2010-02-07 20:22:05 tungsten-b takeover 28 2010-02-07 20:24:14 tungsten-b giveback 16 2010-02-07 20:35:14 tungsten-a takeover 8 2010-02-07 20:42:31 tungsten-a giveback 21 2010-02-07 20:48:18 tungsten-b takeover 26 2010-02-07 20:51:13 tungsten-b giveback 12 2010-02-07 21:47:50 carbon-b takeover 18 2010-02-07 21:52:20 carbon-b giveback 7 2010-02-07 22:03:48 carbon-a takeover 19 2010-02-07 22:09:14 carbon-a giveback 6 2010-02-07 22:38:13 carbon-a takeover 18 2010-02-07 22:39:18 carbon-a giveback 2 2010-02-07 22:40:55 carbon-b takeover 17 2010-02-07 22:42:51 carbon-b giveback 2 2010-03-07 20:01:49 iron-b takeover 21 2010-03-07 20:06:39 iron-b giveback 53 2010-07-23 08:14:06 carbon-a takeover 24 2010-07-23 16:24:22 carbon-a giveback 4 2010-11-21 20:37:11 tungsten-b takeover 97 2010-11-22 06:02:15 tungsten-b takeover 44
Tungsten RCA Report: Q1 2012 19 Created: 2012-02-08
Stuart Kendrick Updated: 2012-02-09
2010-11-22 06:03:31 tungsten-b giveback 16 2011-02-20 20:27:28 tungsten-a takeover 16 2011-02-20 20:41:16 tungsten-a giveback 106 2011-02-20 21:30:20 tungsten-b takeover 109 2011-02-20 22:24:48 tungsten-b giveback 18 2011-03-23 16:59:10 tungsten-b takeover 808 2011-03-23 18:17:33 tungsten-b giveback 23 2011-04-06 12:40:55 tungsten-a takeover 23 2011-04-06 12:44:01 tungsten-a giveback 46 2011-09-18 20:32:50 tungsten-a takeover 43 2011-09-18 21:35:49 tungsten-a giveback 151 2011-09-18 22:21:44 tungsten-b takeover 281 2011-09-18 22:50:18 tungsten-b giveback 63 2011-12-09 18:52:43 iron-b takeover 286 2011-12-09 22:04:54 iron-b giveback 30 2012-01-10 12:58:22 tungsten-b takeover 290 2012-01-10 17:37:26 tungsten-b giveback 71
#5 Different Fates Why do peoples’ experiences of Tungsten distress vary?
Overview Q: When Tungsten stumbles, some systems don’t notice, others are down for a week: why the difference?
A: Clients vary both in their native capabilities and in how many defects (bugs and misconfigurations) they are carrying.
Outline
Tungsten RCA Report: Q1 2012 20 Created: 2012-02-08
Stuart Kendrick Updated: 2012-02-09
Table 7: Behavior by Client
Flavor NFS Clients iSCSI Clients
SMB Clients Admin Home Drive + SCHARP
(Hidden / Vanished Volumes)
Linux / OS X Patient MS SQL
Servers
Impatient MS
SQL Servers
Windows / OSX Any
Results Several stalls Several stalls Isolated /
Possibly
corrupted
Disruption for several
hours
Disruption for many hours
Why Robust protocol Robust protocol Misconfiguration Fragile protocol + cockpit
error
Failure on Tungsten/Cobalt
Table 8: Why Clients Respond Differently
Client Flavor What Happens
NFS NFS Clients are robust – the protocol exhibits Zen-like patience.
iSCSI ▪ Properly configured/functioning iSCSI clients can survive a minute or more without storage and
employ multi-path strategies which allow them to overcome various pathologies.
▪ Impatient iSCSI clients give up quickly, typically ~20 seconds, and misconfigured clients do not
employ multi-pathing.
▪ Once an iSCSI client gives up, it runs the risk of corrupting its storage, requiring substantial manual
recovery
SMB SMB Clients are fragile – after 30-45 seconds, they give up, and they have no multi-pathing capabilities.
Admin Home Drives
+ SCHARP
Screwed. Tungsten lost sight of these volumes; no amount of clever client protocols helped here.
Tungsten RCA Report: Q1 2012 21 Created: 2012-02-08
Stuart Kendrick Updated: 2012-02-09
Clumps Notes
▪ Windows boxes generally speak a single language: SMB. Silo (Windows 2008) is unusual in that it speaks both SMB and
NFS.
▪ OS X boxes speak SMB and NFS natively; in our environment, we generally configure them to use SMB.
▪ Linux boxes also speak SMB and NFS natively; in our environment, we generally configure them to use NFS.
▪ vColo Hosts employ NFS to acquire their storage from Tungsten; the guests riding inside vColo use the entire gamut of
languages to speak with their clients.
SMB Clients All Windows boxes
Most OS X boxes
Server Message Block (SMB)
The language spoken by Microsoft Windows for many functions, notably, home and shared drive access between clients and servers.
Written by IBM in the early 1980s, SMB v1 has become the defacto standard protocol: NetApp filers, Samba servers, OS X clients,
many others. SMB v2.0 shipped with Windows Vista and Windows 2008, SMB v2.1 with Windows 7 and Windows 2008R2, and
SMB v2.2 will ship with Windows 8. Proprietary, though reverse-engineered by many vendors.
NFS Clients vColo Hosts
Most Linux boxes
Some OS X boxes
Network File System (NFS)
Written by Sun in 1985, the language spoken by Unix/Linux clients for home and shared drive access between clients and servers. We
generally employ NFSv3, which appeared in 1995, although progressive clients employ NFSv4 (appeared in 2000). Standards-based.
The standard protocol in large, high-performance environments where its resilience is valued.
iSCSI Clients MS SQL Servers: ~25, e.g. Enterprise SQL Cluster, SharePoint, numerous NAG boxes
Exchange (although none of the Exchange servers use Tungsten)
Tungsten RCA Report: Q1 2012 22 Created: 2012-02-08
Stuart Kendrick Updated: 2012-02-09
Internet Small Computer Serial Interface
Driven by IBM, HP, and Cisco, with wide-ranging industry support. Standards-based. In contrast to SMB and NFS, used for block-
oriented storage access, a common requirement for commercial RDBMs, although increasingly RDMS vendors are supporting file-
oriented storage access as well, i.e. over NFS and SMB (v2.2.)
#6 Communication See Timeline section for when end-user communication occurred during the 2012-01-10 incident. See below for planned changes to
end-user communication; these are early drafts from the Incident Management Process Improvement Program.
Tungsten RCA Report: Q1 2012 23 Created: 2012-02-08
Stuart Kendrick Updated: 2012-02-09
Communication Processes for Priority 1 Incidents
IMPACT • Widespread: Entire Campus or entire user
population of a service, • e.g.: Messaging, Network
• Significant: A Building or Division or Multiple Business Units
• e.g.: Arnold Building, Exchange Storage Group, Basic Science Division, or a Post Office on Zimbra
• Urgency
• Critical: Grant Critical or Grant Sensitive Processes stopped with no work around
• e.g.: ISIS (Exchange) Storage Group 1 Down
• High: Grant Critical/Grant Sensitive Process affected with work around, or any Security related Incident
• e.g.: breach or stolen laptop – • e.g.: Outlook won’t launch – OWA is
accessible
Tungsten RCA Report: Q1 2012 24 Created: 2012-02-08
Stuart Kendrick Updated: 2012-02-09
Subject Line of Message: Ticket ID: Date of Incident: Start Time: Stop Time:
Description of Incident: Scope: Services Impacted: Impact: Urgency: Priority:
Current Status:
Incident Commander: Incident Commander Contact Info: Bridge Line: War Room: Technician: Next Scheduled Communication to LT: Next Scheduled Communication to Incidents List: Next Communication to EUC:
Template of Information (not all info needs to be delivered to each community)