39
SSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan , Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield, Anand Sivasubramaniam, Ben Cutler, Jie Liu, Badriddine Khessib, Kushagra Vaid 1 The 9 th ACM Systems And Storage Conference (SYSTOR 2016)

SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

Embed Size (px)

Citation preview

Page 1: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

SSD Failures in Datacenters: What? When? And Why?

Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

Anand Sivasubramaniam, Ben Cutler, Jie Liu, Badriddine Khessib, Kushagra Vaid

1The 9th ACM Systems And Storage Conference (SYSTOR 2016)

Page 2: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

SSDs’ popularity

Why SSD Reliability ?

2*Source: IDC, Dec 2015

46.5% annual growth*

Limited field dataDatacenter decision

support

Data reliability

01001100 01001101 11010010 0100000010011100 10111111 10101111 11000101

Page 3: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

SSDs’ popularity

Why SSD Reliability ?

3*Source: IDC, Dec 2015

46.5% annual growth*

Limited field dataDatacenter decision

support

Data reliability

01001100 01001101 11010010 0100000010011100 10111111 10101111 11000101

Large scale

Field data

Page 4: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

SSD Failures

4

Flash failures- Media wear-out- Data Retention- Program disturb- Erase disturb

FTL Mechanisms- Wear levelling- Error detection- Error correction- Flash correct and

refresh, etc.

Page 5: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

SSD Failures

5

Flash failures- Media wear-out- Data Retention- Program disturb- Erase disturb

FTL Mechanisms- Wear levelling- Error detection- Error correction- Flash correct and

refresh, etc.

Page 6: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

SSD Failures

6

Flash failures- Media wear-out- Data Retention- Program disturb- Erase disturb

FTL Mechanisms- Wear levelling- Error detection- Error correction- Flash correct and

refresh, etc.

Page 7: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

SSD Failures

7

Flash failures- Media wear-out- Data Retention- Program disturb- Erase disturb

FTL Mechanisms- Wear levelling- Error detection- Error correction- Flash correct and

refresh, etc.

Page 8: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

SSD Failures

8

Flash failures- Media wear-out- Data Retention- Program disturb- Erase disturb

FTL Mechanisms- Wear levelling- Error detection- Error correction- Flash correct and

refresh, etc.

Page 9: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

SSD Failures

9

Flash failures- Media wear-out- Data Retention- Program disturb- Erase disturb

FTL Mechanisms- Wear levelling- Error detection- Error correction- Flash correct and

refresh, etc.

Fail-stop failures

Page 10: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

SSD Reliability

10

0

0.2

0.4

0.6

0.8

1

1.2

1-A 1-B 1-C 1-D 2-A

An

nu

aliz

ed F

ailu

re R

ate

%

SSD Model

AFR=0.61 AFR=0.73

Consumer Enterprise

Page 11: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

SSD Reliability

11

0

0.2

0.4

0.6

0.8

1

1.2

1-A 1-B 1-C 1-D 2-A

An

nu

aliz

ed F

ailu

re R

ate

%

SSD Model

AFR=0.61 AFR=0.73

Page 12: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

SSD Reliability

12

0

0.2

0.4

0.6

0.8

1

1.2

1-A 1-B 1-C 1-D 2-A

An

nu

aliz

ed F

ailu

re R

ate

%

SSD Model

AFR=0.61 AFR=0.73

5 large datacenters

Page 13: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

SSD Reliability

13

0

0.2

0.4

0.6

0.8

1

1.2

1-A 1-B 1-C 1-D 2-A

An

nu

aliz

ed F

ailu

re R

ate

%

SSD Model

AFR=0.61 AFR=0.73

4 major workloads

Page 14: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

SSD Reliability

14

0

0.2

0.4

0.6

0.8

1

1.2

1-A 1-B 1-C 1-D 2-A

An

nu

aliz

ed F

ailu

re R

ate

%

SSD Model

AFR=0.61 AFR=0.73

6 different rack SKUs

Page 15: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

SSD Reliability

15

0

0.2

0.4

0.6

0.8

1

1.2

1-A 1-B 1-C 1-D 2-A

An

nu

aliz

ed F

ailu

re R

ate

%

SSD Model

AFR=0.61 AFR=0.73

Various factors in production environment could affect SSD failure trends very differently from lab test conditions

Can we understand SSD failures in the presence of various factors ?

Page 16: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

Understanding SSD Failures – An analogy

16

SSD

Reactive

Proactive

Page 17: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

What are the symptoms?

17

FeverUnexpected weight loss

Low blood pressure

Data errors

011001?00101?

Reallocated sectors

SATA downshift

SSD

Program and erase failure

Page 18: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

SSD Failure Symptoms

18

Reallocated Sector Count

Program and Erase Fail Count

CRC and Uncorrectable Error Count

SATA Downshift Count0

0.5

1

1.5

2

2.5

3

3.5

ReallocatedSector Count

Program andErase Failure

Count

CRC andUncorrectable

Error Count

SATADownshift

Count

AFR

%

w Symptom

w/o Symptom

3.95X2.76X

18X

3.91X

Page 19: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

Insufficiency of symptom only diagnosis

19

0

10

20

30

40

50

60

70

Reallocations Program andErase Fail

Data Errors SATADownshift

Any

% o

f d

evic

esFailed Healthy Symptoms seen

only in 62% of failed devices

Page 20: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

What are the factors?

20

Lifestyle

Genetics

Environmental agents

Production environment

Workload

Design decisions

SSD

Page 21: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

Device level correlating factors

21

Average write rate of a device

Average read rate of a device

Total read and/or write usage

Write Amplification

Read Write Ratio 0

0.5

1

1.5

2

2.5

10

15

20

25

30

35

40

45

50

>50

AFR

%

Avg. host writes per day

More results in the paper

Increasing failure trend at higher write rates

Page 22: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

Server level correlating factors

22

SSD space utilization

Disk space utilization

Memory utilization

Processor utilization0

0.2

0.4

0.6

0.8

1

1.2

10 20 30 40 50 60 70

AFR

%Avg. Disk Space Utilization

More results in the paper

Decreasing failure trend at high disk space usage

Page 23: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

Datacenter factors

Rack SKU

Datacenter Facility

23

00.10.20.30.40.50.6

1-D 2-A 1-D 2-A

S1-3a S1-3b

AFR

%

SKU and SSD model

More results in the paper

Same model different behavior

Page 24: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

Understanding SSD Failures – An analogy

24

SSD

Symptoms Factors

Symptoms Factors

MULTI FEATURE ANALYSIS

Page 25: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

Understanding SSD Failures – An analogy

25

SSD

Symptoms Factors

Symptoms Factors

Random forest based binary classificationPermutation feature ranking

Page 26: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

What

Understanding What ?

26

are the important factors ?is their order of importance ?are the important combinations?

Page 27: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

27

0 0.2 0.4 0.6 0.8 1

DataErrors

ReallocSectors

TotalNANDWrites

HostWrites

TotalReads+Writes

AvgMemory

AvgSSDSpace

UsagePerDay

TotalReads

ReadsPerDay

Feature Importance

SYMPTOMS

Understanding What ?

Page 28: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

28

0 0.2 0.4 0.6 0.8 1

DataErrors

ReallocSectors

TotalNANDWrites

HostWrites

TotalReads+Writes

AvgMemory

AvgSSDSpace

UsagePerDay

TotalReads

ReadsPerDay

Feature Importance

DEVICEWORKLOAD

Understanding What ?

Page 29: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

29

0 0.2 0.4 0.6 0.8 1

DataErrors

ReallocSectors

TotalNANDWrites

HostWrites

TotalReads+Writes

AvgMemory

AvgSSDSpace

UsagePerDay

TotalReads

ReadsPerDay

Feature Importance

SERVERWORKLOAD

Understanding What ?

Page 30: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

30

Condition Class

Data Errors <=1 & Reallocated Sectors<=5 H

Data Errors<=1& WAF<=1 H

Media Wear-out=100 & WAF<=1 H

Avg. SSD space >=10 F

Combinations of top 8 important features

Frequent Combinations

SYMPTOMS

Understanding What ?

Page 31: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

31

Condition Class

Data Errors <=1 & Reallocated Sectors<=5 H

Data Errors<=1& WAF<=1 H

Media Wear-out=100 & WAF<=1 H

Avg. SSD space >=10 F

Combinations of top 8 important features

Frequent Combinations

SYMPTOMS +WORKLOAD

Understanding What ?

Page 32: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

32

Condition Class

Data Errors <=1 & Reallocated Sectors<=5 H

Data Errors<=1& WAF<=1 H

Media Wear-out=100 & WAF<=1 H

Avg. SSD space >=10 F

Combinations of top 8 important features

Frequent Combinations

WORKLOAD

Understanding What ?

Page 33: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

What

Understanding When ?

33

is the duration between detection and failure?signatures characterize SSD survivability?

Page 34: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

Understanding When ?

34

00.10.20.30.40.50.60.70.80.9

1

0 2 4 6 8 10 12

CD

F(x)

Time To Fail (months)

50% of failures

> 4 months

Sufficient time to intervene

Page 35: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

Understanding When ?

35

00.10.20.30.40.50.60.70.80.9

1

0 2 4 6 8 10 12

CD

F(x)

Time To Fail (months)

50% of failures

> 4 months

Early failures (< 1 month): Rules include symptoms

and their thresholds

Late failures: Rules contains only

workload factors

Page 36: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

Understanding SSD Failures – An analogy

36

SSD

Symptoms Factors

Symptoms Factors

Observation based causal estimateProbabilistic causal models and Pearl’s do-calculus

Page 37: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

What

37

factors impact SSD reliability?is their magnitude of impact?

Understanding Why ?

Page 38: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

Understanding Why ?

38

SSD model and symptoms have direct impact

Workload impacts failures through media wearout

Page 39: SSD Failures in Datacenters: What? When? And Why? · PDF fileSSD Failures in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield,

Concluding Remarks

• SSD Failures in the field

• Factors -> Symptoms -> Failures

• Important Symptoms: Data Errors and Reallocated Sectors• High intensity and rapid progression fails early

• Important factors: NAND Writes, Total Reads and Writes, etc.

• Direct impact: SSD Model and Symptoms

• Indirect impact: Workload through wear-out

• Future direction: prediction and control

39