Upload
nguyenhanh
View
218
Download
1
Embed Size (px)
Citation preview
SSD Failures in Datacenters: What? When? And Why?
Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,
Anand Sivasubramaniam, Ben Cutler, Jie Liu, Badriddine Khessib, Kushagra Vaid
1The 9th ACM Systems And Storage Conference (SYSTOR 2016)
SSDs’ popularity
Why SSD Reliability ?
2*Source: IDC, Dec 2015
46.5% annual growth*
Limited field dataDatacenter decision
support
Data reliability
01001100 01001101 11010010 0100000010011100 10111111 10101111 11000101
SSDs’ popularity
Why SSD Reliability ?
3*Source: IDC, Dec 2015
46.5% annual growth*
Limited field dataDatacenter decision
support
Data reliability
01001100 01001101 11010010 0100000010011100 10111111 10101111 11000101
Large scale
Field data
SSD Failures
4
Flash failures- Media wear-out- Data Retention- Program disturb- Erase disturb
FTL Mechanisms- Wear levelling- Error detection- Error correction- Flash correct and
refresh, etc.
SSD Failures
5
Flash failures- Media wear-out- Data Retention- Program disturb- Erase disturb
FTL Mechanisms- Wear levelling- Error detection- Error correction- Flash correct and
refresh, etc.
SSD Failures
6
Flash failures- Media wear-out- Data Retention- Program disturb- Erase disturb
FTL Mechanisms- Wear levelling- Error detection- Error correction- Flash correct and
refresh, etc.
SSD Failures
7
Flash failures- Media wear-out- Data Retention- Program disturb- Erase disturb
FTL Mechanisms- Wear levelling- Error detection- Error correction- Flash correct and
refresh, etc.
SSD Failures
8
Flash failures- Media wear-out- Data Retention- Program disturb- Erase disturb
FTL Mechanisms- Wear levelling- Error detection- Error correction- Flash correct and
refresh, etc.
SSD Failures
9
Flash failures- Media wear-out- Data Retention- Program disturb- Erase disturb
FTL Mechanisms- Wear levelling- Error detection- Error correction- Flash correct and
refresh, etc.
Fail-stop failures
SSD Reliability
10
0
0.2
0.4
0.6
0.8
1
1.2
1-A 1-B 1-C 1-D 2-A
An
nu
aliz
ed F
ailu
re R
ate
%
SSD Model
AFR=0.61 AFR=0.73
Consumer Enterprise
SSD Reliability
11
0
0.2
0.4
0.6
0.8
1
1.2
1-A 1-B 1-C 1-D 2-A
An
nu
aliz
ed F
ailu
re R
ate
%
SSD Model
AFR=0.61 AFR=0.73
SSD Reliability
12
0
0.2
0.4
0.6
0.8
1
1.2
1-A 1-B 1-C 1-D 2-A
An
nu
aliz
ed F
ailu
re R
ate
%
SSD Model
AFR=0.61 AFR=0.73
5 large datacenters
SSD Reliability
13
0
0.2
0.4
0.6
0.8
1
1.2
1-A 1-B 1-C 1-D 2-A
An
nu
aliz
ed F
ailu
re R
ate
%
SSD Model
AFR=0.61 AFR=0.73
4 major workloads
SSD Reliability
14
0
0.2
0.4
0.6
0.8
1
1.2
1-A 1-B 1-C 1-D 2-A
An
nu
aliz
ed F
ailu
re R
ate
%
SSD Model
AFR=0.61 AFR=0.73
6 different rack SKUs
SSD Reliability
15
0
0.2
0.4
0.6
0.8
1
1.2
1-A 1-B 1-C 1-D 2-A
An
nu
aliz
ed F
ailu
re R
ate
%
SSD Model
AFR=0.61 AFR=0.73
Various factors in production environment could affect SSD failure trends very differently from lab test conditions
Can we understand SSD failures in the presence of various factors ?
Understanding SSD Failures – An analogy
16
SSD
Reactive
Proactive
What are the symptoms?
17
FeverUnexpected weight loss
Low blood pressure
Data errors
011001?00101?
Reallocated sectors
SATA downshift
SSD
Program and erase failure
SSD Failure Symptoms
18
Reallocated Sector Count
Program and Erase Fail Count
CRC and Uncorrectable Error Count
SATA Downshift Count0
0.5
1
1.5
2
2.5
3
3.5
ReallocatedSector Count
Program andErase Failure
Count
CRC andUncorrectable
Error Count
SATADownshift
Count
AFR
%
w Symptom
w/o Symptom
3.95X2.76X
18X
3.91X
Insufficiency of symptom only diagnosis
19
0
10
20
30
40
50
60
70
Reallocations Program andErase Fail
Data Errors SATADownshift
Any
% o
f d
evic
esFailed Healthy Symptoms seen
only in 62% of failed devices
What are the factors?
20
Lifestyle
Genetics
Environmental agents
Production environment
Workload
Design decisions
SSD
Device level correlating factors
21
Average write rate of a device
Average read rate of a device
Total read and/or write usage
Write Amplification
Read Write Ratio 0
0.5
1
1.5
2
2.5
10
15
20
25
30
35
40
45
50
>50
AFR
%
Avg. host writes per day
More results in the paper
Increasing failure trend at higher write rates
Server level correlating factors
22
SSD space utilization
Disk space utilization
Memory utilization
Processor utilization0
0.2
0.4
0.6
0.8
1
1.2
10 20 30 40 50 60 70
AFR
%Avg. Disk Space Utilization
More results in the paper
Decreasing failure trend at high disk space usage
Datacenter factors
Rack SKU
Datacenter Facility
23
00.10.20.30.40.50.6
1-D 2-A 1-D 2-A
S1-3a S1-3b
AFR
%
SKU and SSD model
More results in the paper
Same model different behavior
Understanding SSD Failures – An analogy
24
SSD
Symptoms Factors
Symptoms Factors
MULTI FEATURE ANALYSIS
Understanding SSD Failures – An analogy
25
SSD
Symptoms Factors
Symptoms Factors
Random forest based binary classificationPermutation feature ranking
What
Understanding What ?
26
are the important factors ?is their order of importance ?are the important combinations?
27
0 0.2 0.4 0.6 0.8 1
DataErrors
ReallocSectors
TotalNANDWrites
HostWrites
TotalReads+Writes
AvgMemory
AvgSSDSpace
UsagePerDay
TotalReads
ReadsPerDay
Feature Importance
SYMPTOMS
Understanding What ?
28
0 0.2 0.4 0.6 0.8 1
DataErrors
ReallocSectors
TotalNANDWrites
HostWrites
TotalReads+Writes
AvgMemory
AvgSSDSpace
UsagePerDay
TotalReads
ReadsPerDay
Feature Importance
DEVICEWORKLOAD
Understanding What ?
29
0 0.2 0.4 0.6 0.8 1
DataErrors
ReallocSectors
TotalNANDWrites
HostWrites
TotalReads+Writes
AvgMemory
AvgSSDSpace
UsagePerDay
TotalReads
ReadsPerDay
Feature Importance
SERVERWORKLOAD
Understanding What ?
30
Condition Class
Data Errors <=1 & Reallocated Sectors<=5 H
Data Errors<=1& WAF<=1 H
Media Wear-out=100 & WAF<=1 H
Avg. SSD space >=10 F
Combinations of top 8 important features
Frequent Combinations
SYMPTOMS
Understanding What ?
31
Condition Class
Data Errors <=1 & Reallocated Sectors<=5 H
Data Errors<=1& WAF<=1 H
Media Wear-out=100 & WAF<=1 H
Avg. SSD space >=10 F
Combinations of top 8 important features
Frequent Combinations
SYMPTOMS +WORKLOAD
Understanding What ?
32
Condition Class
Data Errors <=1 & Reallocated Sectors<=5 H
Data Errors<=1& WAF<=1 H
Media Wear-out=100 & WAF<=1 H
Avg. SSD space >=10 F
Combinations of top 8 important features
Frequent Combinations
WORKLOAD
Understanding What ?
What
Understanding When ?
33
is the duration between detection and failure?signatures characterize SSD survivability?
Understanding When ?
34
00.10.20.30.40.50.60.70.80.9
1
0 2 4 6 8 10 12
CD
F(x)
Time To Fail (months)
50% of failures
> 4 months
Sufficient time to intervene
Understanding When ?
35
00.10.20.30.40.50.60.70.80.9
1
0 2 4 6 8 10 12
CD
F(x)
Time To Fail (months)
50% of failures
> 4 months
Early failures (< 1 month): Rules include symptoms
and their thresholds
Late failures: Rules contains only
workload factors
Understanding SSD Failures – An analogy
36
SSD
Symptoms Factors
Symptoms Factors
Observation based causal estimateProbabilistic causal models and Pearl’s do-calculus
What
37
factors impact SSD reliability?is their magnitude of impact?
Understanding Why ?
Understanding Why ?
38
SSD model and symptoms have direct impact
Workload impacts failures through media wearout
Concluding Remarks
• SSD Failures in the field
• Factors -> Symptoms -> Failures
• Important Symptoms: Data Errors and Reallocated Sectors• High intensity and rapid progression fails early
• Important factors: NAND Writes, Total Reads and Writes, etc.
• Direct impact: SSD Model and Symptoms
• Indirect impact: Workload through wear-out
• Future direction: prediction and control
39