24
Tungsten RCA Report: Q1 2012 1 Created: 2012-02-08 Stuart Kendrick Updated: 2012-02-09 Tungsten RCA Report Focus on 2012-01-10 Contents Overview ......................................................................................................................................... 3 Summary ......................................................................................................................................... 3 Outstanding ..................................................................................................................................... 4 Team ............................................................................................................................................... 4 #1 Sanity Check Disk Failure Rate ................................................................................................. 4 Failed Disk History ..................................................................................................................... 4 Disk Failure Rate ........................................................................................................................ 5 Temperature vs Failure Rate ....................................................................................................... 5 Likelihood of a disk failure resulting in Catastrophic Failure .................................................... 6 Consult the Literature ................................................................................................................. 7 Further reading ........................................................................................................................ 8 #2 Consolidated Storage Major Events Timeline ........................................................................... 9 #3 Cobalt Stumbles ....................................................................................................................... 15 #4 Tungsten Failovers Bumpy ...................................................................................................... 15 High-Level ................................................................................................................................ 15 Detail ......................................................................................................................................... 15 Long NetApp Failover .............................................................................................................. 17 #5 Different Fates ......................................................................................................................... 19 Overview ................................................................................................................................... 19 Outline....................................................................................................................................... 19 Clumps ...................................................................................................................................... 21 SMB Clients .......................................................................................................................... 21 NFS Clients ........................................................................................................................... 21 iSCSI Clients ......................................................................................................................... 21 #6 Communication ........................................................................................................................ 22

Tungsten RCA Report - skendric.com · This document summarizes the RCA team’s findings around the Wednesday January 10th incident, focused on the following questions. 1. Sanity-check

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Tungsten RCA Report - skendric.com · This document summarizes the RCA team’s findings around the Wednesday January 10th incident, focused on the following questions. 1. Sanity-check

Tungsten RCA Report: Q1 2012 1 Created: 2012-02-08

Stuart Kendrick Updated: 2012-02-09

Tungsten RCA Report

Focus on 2012-01-10

Contents

Overview ......................................................................................................................................... 3

Summary ......................................................................................................................................... 3

Outstanding ..................................................................................................................................... 4

Team ............................................................................................................................................... 4

#1 Sanity Check Disk Failure Rate ................................................................................................. 4

Failed Disk History ..................................................................................................................... 4

Disk Failure Rate ........................................................................................................................ 5

Temperature vs Failure Rate ....................................................................................................... 5

Likelihood of a disk failure resulting in Catastrophic Failure .................................................... 6

Consult the Literature ................................................................................................................. 7

Further reading ........................................................................................................................ 8

#2 Consolidated Storage Major Events Timeline ........................................................................... 9

#3 Cobalt Stumbles ....................................................................................................................... 15

#4 Tungsten Failovers Bumpy ...................................................................................................... 15

High-Level ................................................................................................................................ 15

Detail ......................................................................................................................................... 15

Long NetApp Failover .............................................................................................................. 17

#5 Different Fates ......................................................................................................................... 19

Overview ................................................................................................................................... 19

Outline....................................................................................................................................... 19

Clumps ...................................................................................................................................... 21

SMB Clients .......................................................................................................................... 21

NFS Clients ........................................................................................................................... 21

iSCSI Clients ......................................................................................................................... 21

#6 Communication ........................................................................................................................ 22

Page 2: Tungsten RCA Report - skendric.com · This document summarizes the RCA team’s findings around the Wednesday January 10th incident, focused on the following questions. 1. Sanity-check

Tungsten RCA Report: Q1 2012 2 Created: 2012-02-08

Stuart Kendrick Updated: 2012-02-09

Tables

Table 1: Summary of Answers ...................................................................................................... 3

Table 2: Outstanding Questions ..................................................................................................... 4 Table 3: Team Statistics ................................................................................................................. 4 Table 4: Drive Cage Temperatures ................................................................................................. 6 Table 5: Drive Failure and Position ............................................................................................... 6 Table 6: Time Between Failures .................................................................................................... 7

Table 7: Behavior by Client ......................................................................................................... 20 Table 8: Why Clients Respond Differently ................................................................................. 20

Page 3: Tungsten RCA Report - skendric.com · This document summarizes the RCA team’s findings around the Wednesday January 10th incident, focused on the following questions. 1. Sanity-check

Tungsten RCA Report: Q1 2012 3 Created: 2012-02-08

Stuart Kendrick Updated: 2012-02-09

Overview This document summarizes the RCA team’s findings around the Wednesday January 10th

incident, focused on the following questions.

1. Sanity-check Cobalt disk failure rate against industry averages

2. Build a timeline for the week surrounding the event

3. Explain why Cobalt disk maintenance triggers Tungsten failovers

4. Explain why Tungsten failovers are bumpy

5. Explain why different clients/services suffer different fates

6. Describe how we communicated to end-users and what we might do differently next time

After initiation, leadership asked the team to add the following. The team did not reach these.

7. Explain how Tungsten’s CPU utilization plays into failover scenarios

8. Explore the consequences of breaking Tungsten HA

Duration: Wednesday January 18th – Thursday February 9th

For detailed explanation and recommendations around the Enterprise MSSQL Clusters

(Harvard/Princeton), see Enterprise-MSSQL-Clusters-Harvard-and-Princeton-Storage-

Events.doc from Don Butler.

Summary Defect Bug, configuration error, suboptimal configuration choice

Table 1: Summary of Answers

Question Answer

#1: Disk Failure Rate Within industry norms; exacerbated by known defects in original disk

inventory

#2: Timeline See Timeline section

#3: Cobalt Stumbles In progress

#4: Tungsten Failovers Multiple influences, some still under investigation, others induced by

defects in clients and cockpit error

#5: Different Fates Two of Tungsten’s services failed entirely, knocking some clients

off-line completely. For the rest, client technology varies, with

varying resilience to pathology, resulting in varying responses to

disruption

#6: Communication See Timeline and Communication sections

Page 4: Tungsten RCA Report - skendric.com · This document summarizes the RCA team’s findings around the Wednesday January 10th incident, focused on the following questions. 1. Sanity-check

Tungsten RCA Report: Q1 2012 4 Created: 2012-02-08

Stuart Kendrick Updated: 2012-02-09

Outstanding This is what we haven’t finished.

Table 2: Outstanding Questions

Question Detail

#1: Disk Failure

Rate

What are the odds of experiencing a catastrophic disk failure on Cobalt,

such that we would lose most or all of the volumes it services?

#3: Cobalt Stumbles Why does Cobalt spike latency when it loses a disk? Why does

Tungsten notice, often to the extent of inducing a multi-disk panic,

while Silo and Fred ride through these events without noticing?

#4: Tungsten

Failovers

Why do our NetApps in general sometimes require an unusually long

time to failover?

#7: CPU Utilization How does Tungsten’s CPU affect failover scenarios?

#8: Tungsten HA Explore the consequences of breaking Tungsten HA: would this help?

Team SME = Subject Matter Expert

[Extract Orwell report next week --sk.]

Table 3: Team Statistics

Member Role Hours

Don Butler SME: Harvard / Princeton

Nancy Crase SME: Communication

Ken Kawakubo SME: Storage

Stuart Kendrick Facilitator + Team Lead

Robert McDermott SME: Enterprise

Randy Rue SME: Storage

#1 Sanity Check Disk Failure Rate Cobalt Disk Failure Analysis Robert McDermott

Failed Disk History The Center’s 3PAR T800 system named Cobalt has suffered 40 disk failures since it was put into

production status on December 10th 2009. Technically there were only 37 failed disks as 3 of the

Page 5: Tungsten RCA Report - skendric.com · This document summarizes the RCA team’s findings around the Wednesday January 10th incident, focused on the following questions. 1. Sanity-check

Tungsten RCA Report: Q1 2012 5 Created: 2012-02-08

Stuart Kendrick Updated: 2012-02-09

40 were marked as failed when the magazine containing them failed1. Figure 1-1 shows Cobalt’s

disk failure time line.

Figure 0-1. Cobalt disk failure history.

Disk Failure Rate The year 2011 is when the system started to see considerable use and when it reached its full 528

disk capacity. Cobalt lost 23 disks in 20112 bringing the annual disk failure rate to 4.36%. This is

inline with some reported industry figures but is much higher than any other storage systems we

manage.

Temperature vs Failure Rate The unique disk cage architecture of the 3PAR is such that there are drives near the front of the

disk cage (cool) or near the back of the cage (hot). We looked at the position of the failed disk to

see if temperature played a significant role in disk failure rates. If high temperature makes disk

fail we should see that disks located closer to the back of the disk cage fail more often. Table 4

shows the dramatic 19F degree temperature difference between the front and back disks.

1 PD 333 failed on 2011-05-20, and the magazine failed on 2011-05-21 causing the untimely failure of the remaining

disk in the magazine (332, 334 and 335). 2 See footnote above for explanation as to why I could 23 disk failures not 26.

Page 6: Tungsten RCA Report - skendric.com · This document summarizes the RCA team’s findings around the Wednesday January 10th incident, focused on the following questions. 1. Sanity-check

Tungsten RCA Report: Q1 2012 6 Created: 2012-02-08

Stuart Kendrick Updated: 2012-02-09

Table 4: Drive Cage Temperatures drive cage coolest (front disk 3) hotest (back disk 0)

0 77 95

1 77 95

2 77 95

3 77 95

4 80.6 98.6

5 77 96.8

6 75.2 98.6

7 80.6 100.4

8 78.8 95

9 78.8 98.6

10 80.6 100.4

11 78.8 96.8

12 75.2 91.4

13 75.2 93.2

14 73.4 95

15 75.2 93.2

16 73.4 91.4

17 73.4 91.4

18 73.4 91.4

19 73.4 93.2

average 76.55 95.27

Table 5 shows that the hottest disks didn’t have a failure rate higher than the coolest disks. More

of the coolest disks failed than any other temperate but it’s not likely statistically significant.

Table 5: Drive Failure and Position

disk position failed disk count

0 9

1 9

2 7

3 12

Likelihood of a disk failure resulting in Catastrophic Failure The 3PAR doesn’t use traditional disk based raid groups, but instead makes use of 256MB

objects called “chunklets”. This allows the system to start repairing itself immediately after a

drive has failed rather than weighting for a replacement disk to be installed in the system.

All of the RAID group on Cobalt on RAID-6 with a set size of 14+2. The volumes have been

configured to be able to survive a failure of any one of the disk magazines (4 disk per magazine).

The volumes can survive the loss of any two chunklets per RAID group; each volume is

comprised of many RAID groups.

The volumes on Cobalt should be able to handle the loss of any individual disk magazine or two

disks in separate magazines. Volume failure and data loss can be expected if three drives, sitting

in two or three different magazines, fail in a short amount of time before the system has

protected itself from the first failure. We currently don’t have an accurate timing on the time it

Page 7: Tungsten RCA Report - skendric.com · This document summarizes the RCA team’s findings around the Wednesday January 10th incident, focused on the following questions. 1. Sanity-check

Tungsten RCA Report: Q1 2012 7 Created: 2012-02-08

Stuart Kendrick Updated: 2012-02-09

takes for the system to restore full RAID-6 protection after a disk failure; the longer this window,

the greater the risk.

Given all the above and the fact that the drives often fail in clusters, I lack the abilities to produce

a mathematical model to determine the probability of catastrophic failure. Table 4-1 shows the

time between all disk failures (except 3 due to magazine failure). Even if the system takes 12

hours to restore full RAID-6 protection we would have had two events where a second disk

failed before the system had restored full data protection. Still assuming 12 hours to restore

protection, there would have been two periods where we were at risk of a third disk failure taking

out the volumes.

Table 6: Time Between Failures Between Hours

F1 – F2 1052.1

F2 – F3 2032.9

F3 – F4 45.88

F4 – F5 38.46

F5 – F6 1848.46

F6 – F7 1135.45

F7 – F8 833.66

F8– F9 30.4

F9 – F10 1481.03

F10 – F11 30.32

F11 – F12 1043.18

F12 – F13 12.6

F13 – F14 144.45

F14 – F15 180.16

F15 – F16 85.12

F16 – F17 582.76

F17 – F18 435.72

F18 – F19 1684.97

F19 – F20 41.28

F20 – F21 11.85

F21 – F22 39.87

F22 – F23 60.27

F23 – F24 8.3

F24 – F25 181.23

F25 – F26 875.93

F26 – F27 111.92

F27 – F28 17.07

F28 – F29 1119.7

F29 – F30 54.47

F30 – F31 243.92

F31 – F32 40.66

F32 – F33 12.95

F33 – F34 39.81

F34 – F35 450.35

F35 – F36 35.23

F36 – F37 46.35

Consult the Literature Stuart Kendrick

“Are Disks the Dominant Contributor for Storage Failures?

A Comprehensive Study of Storage Subsystem Failure Characteristics”

Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky

Page 8: Tungsten RCA Report - skendric.com · This document summarizes the RCA team’s findings around the Wednesday January 10th incident, focused on the following questions. 1. Sanity-check

Tungsten RCA Report: Q1 2012 8 Created: 2012-02-08

Stuart Kendrick Updated: 2012-02-09

[University of Illinois + Network Appliance Inc]

Presented at the UseNix Conference on File and Storage Technologies 2008

http://www.usenix.org/event/fast08/tech/full_papers/jiang/jiang.pdf

Large Data Set

Abstract

[…]

This paper analyzes the failure characteristics of storage subsystems. More specifically, we

analyzed the storage logs collected from about 39,000 storage systems commercially deployed at

various customer sites. The data set covers a period of 44 months and includes about 1,800,000

disks hosted in about 155,000 storage shelf enclosures.

[…]

Failure Rates and Reasons

Generally, disks experience an annualized failure rate of <1%. However, in the field, customers

replace disks at rates ranging from 2-4%, for various reasons explained in the paper.

Storage typically fails for a range of reasons, of which disk failure is a common but not dominant

factor).

In addition to disk failures that contribute to 20-55% of storage subsystem failures, other

components such as physical interconnects (including shelf enclosures) and protocol stacks also

account for significant percentages (27-68% and 5-10%, respectively) of failures.

Clustered failures are common

Each individual storage subsystem failure type and storage subsystem failure as a whole exhibit

strong correlations, (i.e. after one failure, the probability of additional failures of the same type

is higher). In addition, failures also exhibit bursty patterns in time distribution, (i.e. multiple

failures of the same type tend to happen relatively close together).

Further reading Disk Failures

http://www.cs.cmu.edu/~bianca/project_pages/project_reliability.html

Empirical measurements of disk failure rates and error rates

http://arxiv.org/ftp/cs/papers/0701/0701166.pdf

Reliability analysis of disk drive failure mechanisms

http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CC0QFjAA&ur

l=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.160.161

8%26rep%3Drep1%26type%3Dpdf&ei=bGYQT5rrA6baiQLemqmpDQ&usg=AFQjCNFyuB27

Q8o7wYw8Ga1i8qZxDSWt6A

Latent Sector Errors

http://www.usenix.org/event/fast07/tech/full_papers/pinheiro/pinheiro.pdf

Page 9: Tungsten RCA Report - skendric.com · This document summarizes the RCA team’s findings around the Wednesday January 10th incident, focused on the following questions. 1. Sanity-check

Tungsten RCA Report: Q1 2012 9 Created: 2012-02-08

Stuart Kendrick Updated: 2012-02-09

Vibration

http://www.usenix.org/event/sustainit10/tech/full_papers/turner.pdf

Using Syslog Messages for Predicting Disk Failures

http://www.usenix.org/event/lisa10/tech/slides/featherstun.pdf

System Impacts of Storage Trends: Hard Errors and Testability

http://www.usenix.org/publications/login/2011-06/pdfs/Hetzler.pdf

Disks are like snowflakes: no two are alike

http://www.usenix.org/events/hotos/tech/final_files/Krevat.pdf

An Analysis of Data Corruption in the Storage Stack

http://www.usenix.org/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram_html/

#2 Consolidated Storage Major Events Timeline Ken Kawakubo

Page 10: Tungsten RCA Report - skendric.com · This document summarizes the RCA team’s findings around the Wednesday January 10th incident, focused on the following questions. 1. Sanity-check

Tungsten RCA Report: Q1 2012 10 Created: 2012-02-08

Stuart Kendrick Updated: 2012-02-09

12/3/2009 2/5/2012

1/1/2010 4/1/2010 7/1/2010 10/1/2010 1/1/2011 4/1/2011 7/1/2011 10/1/2011 1/1/2012

Consolidated Storage Major Events Timeline

12/3/2009

CS goes live with 222TB usable space 10/1/2010

1st Cobalt Expansion to 308TB usable space

1/9/2011

2nd Cobalt Expansionto 398TB usable space

4/25/2011

3rd and final Cobalt ExpansionTo 590TB usable space

9/18/2011

512GB FlashCachemodules installedon Tungsten

11/12/2010

Tungsten-A bad DIMMtook down Tungsten-AFor 12 hours

11/4/2010

· Fred lost one virtual disk from Cobalt(cockpit error)

· Took a month+ to restore data from tape

3/1/2011

Tungsten-A CPU usagenearing 100%

9/21/2011

· Silo lost S drive provisioned from Cobalt· TSM backup not kept up and lost some data

3/23/2011

· Cobalt lost 2 disks and Tungsten-A had its first MDP(multi-disk failure panic) failover

· No data lost but took 2 days to recover

April 2011

· Efforts to lower the load on Tungsten-A started· 10Gbps links added to Tungsten-b so that Tungstens can be active-active· Vcolodata vfiler moved from Tungsten-A to Tunsten-b· SIS (dedupe) disabled on most volumes· By the end of April, the CPU utilization on Tungsten-A stabilized at about 70 percent.

January 2010

Steady migration ofdata to CS started.

1/10/2012

· Tungsten-A had its 2nd MDP (multi-disk failure panic) failover· No data lost but over 100 hours of Ops time to restore services

January 2012

Tungsten-A CPU usageback to 100%

2/1/2012

Tungsten-A had its 3rd MDP(multi-disk failure panic) failover

kkawakub 2012-02-04

Page 11: Tungsten RCA Report - skendric.com · This document summarizes the RCA team’s findings around the Wednesday January 10th incident, focused on the following questions. 1. Sanity-check

Tungsten RCA Report: Q1 2012 11 Created: 2012-02-08

Stuart Kendrick Updated: 2012-02-09

Date Event

2009-12-03 CS goes live. Cobalt (3PAR T800) started out with 2 cabinets with 4 controller nodes and

8 shelves (cages). (288) 1TB SATA drives provided 288TB raw or 222TB usable space.

2010 Steady data migration of data to CS started.

2010-02-03 1st Cobalt disk failure

2010-04-08 2nd Cobalt disk failure

2010-08-08/12 3 Cobalt disks failed

2010-10-01 1st Cobalt Expansion - added 1 cabinet, 2 controller nodes, and 7 cages (shelves). (56) 2TB drives (in 14 magazines)

expanded the total raw capacity to 400TB or 308TB usable space

2010-10-12 1 Cobalt disk failed

2010-11-04 Fred lost one virtual disk provisioned from Cobalt. It took over a month to restore data from TSM backup tapes. 1st

major CS-related outage. 50 hours infra Ops (DAS and data protection) time

2010-11-12/13 Network configuration change work on Tungsten required a cluster failover. After the failover event, Tungsten-a

did not boot due to 2 bad DIMMs. Correct DIMMs did not arrive until next morning. Also, some iSCSI LUNs went

offline. 12 hours Infra Ops time

2010-12-15 1 Cobalt disk failed

2011-01-09 2nd Cobalt Expansion – added (60) 2TB drives (in 15 magazines) and expanded the total raw capacity to 520TB or

398TB usable.

2011-01-18/19 2 Cobalt disks failed

2011-02-20 Tungsten Ontap upgraded to 7.3.5P1 to address network configuration and CIFS issues. During the upgrade, major

systems such as Zimbra and Enterprise SQL cluster were shut down in an orderly way to avoid their going offline.

Went mostly smoothly. Used efnet IRC chat for communication among administrators.

2011-03 Tungsten-a CPU usage nearing 100%.

2011-03-20 SCHARP migrates to Consolidated Storage.

2011-03-21 SCHARP users unable to employ CS on account of Firewall-based NATing; configuration change to Tungsten

fixes this. Firewall-induced slowdown plus high Tungsten CPU utilization continue to impact SCHARP.

2011-03-23/25 Cobalt lost 2 disks and Tungsten-a experienced very high CPU usage. Tungsten had its first MDP (multi-disk

failure panic) failover. Vendors’ RCA were not conclusive but pointed to the fact that 2011-03-02, 3 Cobalt virtual

luns are each assigned duplicate disk IDs.

No data was lost.

Page 12: Tungsten RCA Report - skendric.com · This document summarizes the RCA team’s findings around the Wednesday January 10th incident, focused on the following questions. 1. Sanity-check

Tungsten RCA Report: Q1 2012 12 Created: 2012-02-08

Stuart Kendrick Updated: 2012-02-09

20 vColo hosts hung require reboot

1 vColo host requires restore from snapshot

Three databases on Enterprise SQL Cluster require rebuilding

50 hours Infra Ops time

20 hours NAG tech time

2011-04 Efforts to lower the load on Tungsten-a started. 10Gbps links are added to Tungsten-b so that Tungstens can be

active-active. Vcolodata vfiler (one of the most CPU-intensive vfiler) is moved from Tungsten-a to Tungsten-b.

SIS (dedupe) is disabled on most volumes, etc. By the end of April, the CPU utilization on Tungsten-a stabilized at

about 70 percent.

2011-04-23 SCHARP Firewall retired.

2011-04-25 3rd and final Cobalt expansion – added two controllers, five cages (shelves) and (124) 2TB disks and expanded the

total raw capacity to 768TB or 590TB usable.

2011-05-06 2 Cobalt disks failed

2011-05-12 1 Cobalt disk failed

2011-05-20/23 3 Cobalt disks failed

2011-06-25 Completed adding disks / shuffling space (dynamic optimization) on Cobalt.

2011-07-05 1 Cobalt disk failed

2011-07-19 From 6-8pm, Tungsten-a’s CPU3 (Kahuna) was pegged at 100% while the other three CPUs stayed idle. First

“brownout” incident.

2011-08-15 SCHARP reported Tungsten performance issues. This incident turned out to be caused by faulty hardware in the

network.

2011-08-25/30 Cobalt port 4:0:1 CRC errors. Replaced fibre cables, transceivers, FCAL, and HBA.

2011-09-07/09 2 Cobalt disks failed

2011-09-13/20 4 Cobalt disks failed.

2011-09-18 Per recommendation from “Consolidated Storage Performance Analysis”, 512GB FlashCache modules are installed

on the Tungsten pair. At the same time, Ontap was upgraded to 7.3.5.1P5 to address potentially critical

vulnerability to unexpected failovers during disk errors. The failover was orderly and did not cause any issue except

one test iSCSI LUN we left on for testing. This LUN went missing and needed to be restored from backup.

2011-09-21 Silo lost S drive that was provisioned from Cobalt. TSM backup had not been keeping up with Silo changes and

some data was lost. This incident led to the changes in how TSM backed up Silo volumes (use of Journaling).

Page 13: Tungsten RCA Report - skendric.com · This document summarizes the RCA team’s findings around the Wednesday January 10th incident, focused on the following questions. 1. Sanity-check

Tungsten RCA Report: Q1 2012 13 Created: 2012-02-08

Stuart Kendrick Updated: 2012-02-09

2011-09-27 1 Cobalt disk failed

2011-10/11 It became clear Cobalt will run out of space by the end of the year. Sun Thumpers were deployed and used as iSCSI

targets for Silo to give temporary space relief.

2011-11-03/04 2 Cobalt disks failed

2011-11-07/08 2 Cobalt disks failed

2011-12-25/27 2 Cobalt disks failed

2012-01-06 1 Cobalt disk failed

2012-01-10 Tungsten-a had its 2nd MDP (multi-disk failure panic) failover

Detailed sequence of events

11:45am HP tech arrives

12:07pm rrue alerts SOPS to maintenance via pager

12:04-12:13pm HP tech evacuates data from magazine

12:17pm HP tech removes magazine, replaces drive

12:21pm HP tech re-inserts magazine

12:31pm First of ~40 Nagios SERVICE ALERT related to the incident – zimbra-mta1

12:31pm SQL Cluster started complaining about I/O requests taking longer than15 sec

12:32pm Another Cobalt disk failed

12:33pm Tungsten starts error-recovery procedures toward Cobalt

12:33pm Tungsten starts warning that it is losing its failover capabilities

12:34pm Numerous clients start warning of difficulties reaching Tungsten

12:50pm Cobalt FC ports 4:5:1 and 5:5:1 which connect to Tungsten-a port 0a and 0c respectively start SCSI resets (serious

Fibre Channel issues).

12:53pm Multi-disk failure on Tungsten-A: Tungsten-B begins takeover

12:54pm Nagios warns of Tungsten-A distress

12:58pm Takeover completes (290 seconds) , but Tungsten-B cannot see the two disks whose loss caused Tungsten-A to

panic: 0c.0L26 and 0c.0L187

12:58pm rrue initiates giveback

13:00pm Nagios reports Tungsten Cluster status is Critical

13:09pm Giveback completes and fails. CIFS is shutdown.

13:15pm rrue engages NetApp TAC

13:18pm “[scicomp-announce] connectivity loss to storage” e-mail sent.

Page 14: Tungsten RCA Report - skendric.com · This document summarizes the RCA team’s findings around the Wednesday January 10th incident, focused on the following questions. 1. Sanity-check

Tungsten RCA Report: Q1 2012 14 Created: 2012-02-08

Stuart Kendrick Updated: 2012-02-09

13:38pm The first CIT outage e-mail about Consolidated Storage sent. This outage notice pointed out that all resources were

running on Tungsten-B except for scharpdata aggregate 1 and adm_home aggregate which contained the missing

logical disks.

13:46pm CIT Project office sent out e-mail “Replicon and RMT currently offline”.

14:00pm FMS_Support sent out e-mail “[FMSlist] FMS Crystal Reporting Portal Down”.

14:19pm The second CIT outage e-mail sent and confirmed that CIFS access to vfilers was not available on Tungsten-b.

14:28pm rrue restarts CIFS

14:55pm The third CIT update sent and informed that CIFS access to all vfilers except for admhome and scharpdata had been

restored.

15:16pm CIT sent out AllHutch e-mail about the outage.

16:23pm CIT sent out an update that it was planning to restore Tungsten-a at 17:30pm.

16:38pm CIT sent out another AllHutch e-mail describing its efforts to restore services. The e-mail also pointed to

http://status.fhcrc.org for further updates.

17:36pm Tungsten cluster giveback started and completed in 71 seconds.

17:40pm After giveback, Tungsten-a found its logical disks.0L26 and 0c.0L187 intact and both scharpdata aggregate 1 and

admhome aggregate were restored and mounted.

17:40pm For some reason, the network configuration of scharpdata aggregate 1 vfiler was incorrect and it took some time to

restore access to the vfiler.

2012-01-11

05:37am It took time to restore Enterprise SQL cluster iSCSI LUNs until early next morning . At this point, the only

remaining service which was down was CPAS on Princeton. CPAS was finally rebuilt on 2012-01-25.

No data was lost. However, this outage took 80 hours of DBA time and 30 hours of rrue’s time.

2012-01-18/20 Snow closes Hutch.

2012-01 Tungsten-a CPU usage was observed to be back to near 100%.

2012-01-29/31 3 Cobalt disks failed

2012-02-01 Tungsten-a had its 3rd MDP (multi-disk failure panic) failover

2012-02-02 Tungsten-a CPU3 (Kahuna) pegged at 100% as the other CPUs stayed idle due to CIFS lock bug.

Page 15: Tungsten RCA Report - skendric.com · This document summarizes the RCA team’s findings around the Wednesday January 10th incident, focused on the following questions. 1. Sanity-check

Tungsten RCA Report: Q1 2012 15 Created: 2012-02-08

Stuart Kendrick Updated: 2012-02-09

#3 Cobalt Stumbles Stuart Kendrick

I acquired sufficient access permissions to engage both HP and NetApp TAC on this starting Monday February 6th; both TACs have

engaged. Preliminary results suggest a mix of pathology contributed by bugs on both manufacturer’s products. More work remains,

digging through the data to which I now have access, responding to TAC queries for more information, coordinating TAC analysis

sessions.

#4 Tungsten Failovers Bumpy Stuart Kendrick Randy Rue Don Butler

High-Level From a naïve point of view, we understand why Tungsten-A passes the buck to Tungsten-B: it loses sight of one or more of the

volumes which Cobalt provides it. Stepping backward in causation, we do not understand why Tungsten loses sight of these volumes

nor why Fred and Silo are unaffected – this is a work-in-progress (see #3 Cobalt Stumbles).

Once Tungsten loses sight of a volume, it cannot provide services to clients wanting that volume; this is why Administrative users lost

access to their home drives and why SCHARP users lost access to everything.3

Detail Additionally, the 2012-01-10 event was particularly bumpy because we drove Tungsten into a ditch. Here’s how that happened.

1. The Cobalt / Tungsten pathology started ~12:30pm.

3 Similarly, during the 2012-02-01 event, this is why CRD users lost access to their home drives.

Page 16: Tungsten RCA Report - skendric.com · This document summarizes the RCA team’s findings around the Wednesday January 10th incident, focused on the following questions. 1. Sanity-check

Tungsten RCA Report: Q1 2012 16 Created: 2012-02-08

Stuart Kendrick Updated: 2012-02-09

2. Tungsten-A Panicked, and Tungsten-B initiated a Takeover, at 12:50pm. This failover took ~5 minutes – unusually long and

enough to break many applications, whose timeouts generally expire after 1-5 minutes. This lengthy Takeover risks data loss

and corruption.

3. Immediately after the failover, ~1:00pm, Randy realized that Tungsten-B was not servicing home_ad1 and scharpdata_aggr1.

4. Wanting to restore service, Randy checked Tungsten-A’s status – Tungsten-A declared that it was ready for a Giveback. Randy

initiated a Giveback ~1:00pm, hoping that Tungsten-A would resume servicing home_ad1 and scharpdata_aggr1. Givebacks

(and Takeovers) involve shutting down SMB, a necessary step on account of this protocol’s fragility.

5. Well, Tungsten-A had Panicked because it couldn’t see home_ad1 and scharpdata_aggr1. The Panic, Takeover, and

subsequent Reboot hadn’t helped … Tungsten-A still couldn’t see these resources. Why did Tungsten-A advertise itself as

ready for a Giveback? It turns out ‘ready for Giveback’ involves a range of sanity checks … but does not include checking for

access to disks – we didn’t know that.

6. The Giveback failed at ~1:10pm, because Tungsten-A still couldn’t see home_ad1 and scharpdata_aggr1.

7. When a Giveback fails, we thought that the backup cluster node would restart SMB. Turns out it doesn’t: ONTAP wants

administrative intervention at this point, so SMB remained stopped, thus disrupting all home and shared drive access for SMB

clients.

8. At this point, Randy had NetApp TAC engaged. He escalated past the front-line to a second tier tech, Mathew Ferguson.

Randy and Matt had trouble communicating: Randy was trying to explain that getting SMB running was his top priority; Matt

was focused on figuring out how to restore access to home_ad1 and scharpdata_aggr1.

9. Ken figured out the problem; Randy restarted SMB on Tungsten-B ~2:30pm, restoring home & shared directory services

(except, of course, for Administrative and SCHARP users).

10. Matt helped Randy restore Tungsten-A’s access to home_ad1 and scharpdata_aggr1.

11. Leadership decided to stall until 5:30pm before trying the next Giveback.

Page 17: Tungsten RCA Report - skendric.com · This document summarizes the RCA team’s findings around the Wednesday January 10th incident, focused on the following questions. 1. Sanity-check

Tungsten RCA Report: Q1 2012 17 Created: 2012-02-08

Stuart Kendrick Updated: 2012-02-09

12. The evening Giveback succeeded: Tungsten-A could see all disks, per earlier work. But, for reasons we don’t understand, the

SCHARP vFiler’s IP address settings vanished. Sounds like a bug in ONTAP; we do not plan to investigate this further. Took

a while to figure this out and fix it.

13. Notice that Fred and Silo were unaffected, as usual, by the Cobalt / Tungsten pathology.

14. Notice also that NFS and iSCSI clients of Tungsten rode through using their usual mechanisms: unlike SMB, both protocols

employ various features to smooth over the bumps induced. Nevertheless, applications are not so forgiving: that initial 5

minute Takeover broke some of them. For example, some vColo Guests needed rebooting in order to recover. As far as we

know, we got lucky: no data loss or corruption.

15. Notice too that Harvard/Princeton crashed for several reasons. For starters, their timeouts are cranked down too low,

somewhere in the ~20 second range. MS SQL Server can survive 60-90 seconds without storage, and this is how Microsoft &

NetApp recommend configuring them. It is possible that Harvard/Princeton were correctly configured when we installed

them … but Microsoft patches, installed subsequently, have rolled back those changes to the default ~20 seconds. Regardless,

the first Takeover took ~5 minutes, so even if the MS SQL Servers (Harvard/Princeton, ~25 others which we have not

analyzed) had been configured correctly, they still would have crashed.

16. In addition, once Harvard/Princeton were restored to life, they could not gain access to one of their volumes – they saw it as

Unavailable. We believe this is due to a bug in Windows Clustering services. The team worked through the night to get

around this bug: nothing new here: we’ve seen this bug before … but Randy discovered an innovative work-around, which

reduced recovery time.

17. Restarting MS SQL Server and repairing the databases took many more hours. We left CPAS down for a week+, on account

of it being big (time-consuming to repair) and less important (end-users could live without it for an extended period of time,

extended by the snow event).

Long NetApp Failover Overly long NetApp failover is not unique to Tungsten; it shows up on Iron as well. I propose to continue investigating this.

# Statistics / NetApp Failovers # # Client Flavor count min max mean mode stddev range

Page 18: Tungsten RCA Report - skendric.com · This document summarizes the RCA team’s findings around the Wednesday January 10th incident, focused on the following questions. 1. Sanity-check

Tungsten RCA Report: Q1 2012 18 Created: 2012-02-08

Stuart Kendrick Updated: 2012-02-09

------------ -------- ----- ------ ------ ------ ------ ------ ------ carbon-a takeover 3 18 24 20 20 3 6 carbon-a giveback 3 2 6 4 4 2 4 carbon-b takeover 2 17 18 17 17 0 1 carbon-b giveback 2 2 7 4 4 3 5 iron-b takeover 2 21 286 153 153 187 265 iron-b giveback 2 30 53 41 41 16 23 tungsten-a takeover 5 6 43 19 19 14 37 tungsten-a giveback 5 21 151 69 69 57 130 tungsten-b takeover 8 26 808 210 210 263 782 tungsten-b giveback 7 12 71 31 16 24 59

# History of NetApp failovers # # Date Time Host Flavor Duration ---------- -------- ------------ -------- -------- 2010-02-07 20:10:33 tungsten-a takeover 6 2010-02-07 20:13:26 tungsten-a giveback 22 2010-02-07 20:22:05 tungsten-b takeover 28 2010-02-07 20:24:14 tungsten-b giveback 16 2010-02-07 20:35:14 tungsten-a takeover 8 2010-02-07 20:42:31 tungsten-a giveback 21 2010-02-07 20:48:18 tungsten-b takeover 26 2010-02-07 20:51:13 tungsten-b giveback 12 2010-02-07 21:47:50 carbon-b takeover 18 2010-02-07 21:52:20 carbon-b giveback 7 2010-02-07 22:03:48 carbon-a takeover 19 2010-02-07 22:09:14 carbon-a giveback 6 2010-02-07 22:38:13 carbon-a takeover 18 2010-02-07 22:39:18 carbon-a giveback 2 2010-02-07 22:40:55 carbon-b takeover 17 2010-02-07 22:42:51 carbon-b giveback 2 2010-03-07 20:01:49 iron-b takeover 21 2010-03-07 20:06:39 iron-b giveback 53 2010-07-23 08:14:06 carbon-a takeover 24 2010-07-23 16:24:22 carbon-a giveback 4 2010-11-21 20:37:11 tungsten-b takeover 97 2010-11-22 06:02:15 tungsten-b takeover 44

Page 19: Tungsten RCA Report - skendric.com · This document summarizes the RCA team’s findings around the Wednesday January 10th incident, focused on the following questions. 1. Sanity-check

Tungsten RCA Report: Q1 2012 19 Created: 2012-02-08

Stuart Kendrick Updated: 2012-02-09

2010-11-22 06:03:31 tungsten-b giveback 16 2011-02-20 20:27:28 tungsten-a takeover 16 2011-02-20 20:41:16 tungsten-a giveback 106 2011-02-20 21:30:20 tungsten-b takeover 109 2011-02-20 22:24:48 tungsten-b giveback 18 2011-03-23 16:59:10 tungsten-b takeover 808 2011-03-23 18:17:33 tungsten-b giveback 23 2011-04-06 12:40:55 tungsten-a takeover 23 2011-04-06 12:44:01 tungsten-a giveback 46 2011-09-18 20:32:50 tungsten-a takeover 43 2011-09-18 21:35:49 tungsten-a giveback 151 2011-09-18 22:21:44 tungsten-b takeover 281 2011-09-18 22:50:18 tungsten-b giveback 63 2011-12-09 18:52:43 iron-b takeover 286 2011-12-09 22:04:54 iron-b giveback 30 2012-01-10 12:58:22 tungsten-b takeover 290 2012-01-10 17:37:26 tungsten-b giveback 71

#5 Different Fates Why do peoples’ experiences of Tungsten distress vary?

Overview Q: When Tungsten stumbles, some systems don’t notice, others are down for a week: why the difference?

A: Clients vary both in their native capabilities and in how many defects (bugs and misconfigurations) they are carrying.

Outline

Page 20: Tungsten RCA Report - skendric.com · This document summarizes the RCA team’s findings around the Wednesday January 10th incident, focused on the following questions. 1. Sanity-check

Tungsten RCA Report: Q1 2012 20 Created: 2012-02-08

Stuart Kendrick Updated: 2012-02-09

Table 7: Behavior by Client

Flavor NFS Clients iSCSI Clients

SMB Clients Admin Home Drive + SCHARP

(Hidden / Vanished Volumes)

Linux / OS X Patient MS SQL

Servers

Impatient MS

SQL Servers

Windows / OSX Any

Results Several stalls Several stalls Isolated /

Possibly

corrupted

Disruption for several

hours

Disruption for many hours

Why Robust protocol Robust protocol Misconfiguration Fragile protocol + cockpit

error

Failure on Tungsten/Cobalt

Table 8: Why Clients Respond Differently

Client Flavor What Happens

NFS NFS Clients are robust – the protocol exhibits Zen-like patience.

iSCSI ▪ Properly configured/functioning iSCSI clients can survive a minute or more without storage and

employ multi-path strategies which allow them to overcome various pathologies.

▪ Impatient iSCSI clients give up quickly, typically ~20 seconds, and misconfigured clients do not

employ multi-pathing.

▪ Once an iSCSI client gives up, it runs the risk of corrupting its storage, requiring substantial manual

recovery

SMB SMB Clients are fragile – after 30-45 seconds, they give up, and they have no multi-pathing capabilities.

Admin Home Drives

+ SCHARP

Screwed. Tungsten lost sight of these volumes; no amount of clever client protocols helped here.

Page 21: Tungsten RCA Report - skendric.com · This document summarizes the RCA team’s findings around the Wednesday January 10th incident, focused on the following questions. 1. Sanity-check

Tungsten RCA Report: Q1 2012 21 Created: 2012-02-08

Stuart Kendrick Updated: 2012-02-09

Clumps Notes

▪ Windows boxes generally speak a single language: SMB. Silo (Windows 2008) is unusual in that it speaks both SMB and

NFS.

▪ OS X boxes speak SMB and NFS natively; in our environment, we generally configure them to use SMB.

▪ Linux boxes also speak SMB and NFS natively; in our environment, we generally configure them to use NFS.

▪ vColo Hosts employ NFS to acquire their storage from Tungsten; the guests riding inside vColo use the entire gamut of

languages to speak with their clients.

SMB Clients All Windows boxes

Most OS X boxes

Server Message Block (SMB)

The language spoken by Microsoft Windows for many functions, notably, home and shared drive access between clients and servers.

Written by IBM in the early 1980s, SMB v1 has become the defacto standard protocol: NetApp filers, Samba servers, OS X clients,

many others. SMB v2.0 shipped with Windows Vista and Windows 2008, SMB v2.1 with Windows 7 and Windows 2008R2, and

SMB v2.2 will ship with Windows 8. Proprietary, though reverse-engineered by many vendors.

NFS Clients vColo Hosts

Most Linux boxes

Some OS X boxes

Network File System (NFS)

Written by Sun in 1985, the language spoken by Unix/Linux clients for home and shared drive access between clients and servers. We

generally employ NFSv3, which appeared in 1995, although progressive clients employ NFSv4 (appeared in 2000). Standards-based.

The standard protocol in large, high-performance environments where its resilience is valued.

iSCSI Clients MS SQL Servers: ~25, e.g. Enterprise SQL Cluster, SharePoint, numerous NAG boxes

Exchange (although none of the Exchange servers use Tungsten)

Page 22: Tungsten RCA Report - skendric.com · This document summarizes the RCA team’s findings around the Wednesday January 10th incident, focused on the following questions. 1. Sanity-check

Tungsten RCA Report: Q1 2012 22 Created: 2012-02-08

Stuart Kendrick Updated: 2012-02-09

Internet Small Computer Serial Interface

Driven by IBM, HP, and Cisco, with wide-ranging industry support. Standards-based. In contrast to SMB and NFS, used for block-

oriented storage access, a common requirement for commercial RDBMs, although increasingly RDMS vendors are supporting file-

oriented storage access as well, i.e. over NFS and SMB (v2.2.)

#6 Communication See Timeline section for when end-user communication occurred during the 2012-01-10 incident. See below for planned changes to

end-user communication; these are early drafts from the Incident Management Process Improvement Program.

Page 23: Tungsten RCA Report - skendric.com · This document summarizes the RCA team’s findings around the Wednesday January 10th incident, focused on the following questions. 1. Sanity-check

Tungsten RCA Report: Q1 2012 23 Created: 2012-02-08

Stuart Kendrick Updated: 2012-02-09

Communication Processes for Priority 1 Incidents

IMPACT • Widespread: Entire Campus or entire user

population of a service, • e.g.: Messaging, Network

• Significant: A Building or Division or Multiple Business Units

• e.g.: Arnold Building, Exchange Storage Group, Basic Science Division, or a Post Office on Zimbra

• Urgency

• Critical: Grant Critical or Grant Sensitive Processes stopped with no work around

• e.g.: ISIS (Exchange) Storage Group 1 Down

• High: Grant Critical/Grant Sensitive Process affected with work around, or any Security related Incident

• e.g.: breach or stolen laptop – • e.g.: Outlook won’t launch – OWA is

accessible

Page 24: Tungsten RCA Report - skendric.com · This document summarizes the RCA team’s findings around the Wednesday January 10th incident, focused on the following questions. 1. Sanity-check

Tungsten RCA Report: Q1 2012 24 Created: 2012-02-08

Stuart Kendrick Updated: 2012-02-09

Subject Line of Message: Ticket ID: Date of Incident: Start Time: Stop Time:

Description of Incident: Scope: Services Impacted: Impact: Urgency: Priority:

Current Status:

Incident Commander: Incident Commander Contact Info: Bridge Line: War Room: Technician: Next Scheduled Communication to LT: Next Scheduled Communication to Incidents List: Next Communication to EUC:

Template of Information (not all info needs to be delivered to each community)