24
PRESENTATION TITLE GOES HERE Health monitoring & predictive analytics To lower the TCO in a datacenter Christian B Madsen & Andrei Khurshudov Seagate Technology [email protected]

PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei ... › ... › Christian_Madsen_Health_Monitoring-nc.pdf · PRESENTATION TITLE GOES HERE Health monitoring & predictive analytics

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei ... › ... › Christian_Madsen_Health_Monitoring-nc.pdf · PRESENTATION TITLE GOES HERE Health monitoring & predictive analytics

PRESENTATION TITLE GOES HERE

Health monitoring & predictive analyticsTo lower the TCO in a datacenter

Christian B Madsen & Andrei KhurshudovSeagate Technology

[email protected]

Page 2: PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei ... › ... › Christian_Madsen_Health_Monitoring-nc.pdf · PRESENTATION TITLE GOES HERE Health monitoring & predictive analytics

22015 Data Storage Innovation Conference. © Seagate Technology. All Rights Reserved.

1.The opportunity

2.Our vision & implementation

3.Use cases

4.Summary

Outline

Page 3: PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei ... › ... › Christian_Madsen_Health_Monitoring-nc.pdf · PRESENTATION TITLE GOES HERE Health monitoring & predictive analytics

PRESENTATION TITLE GOES HERE

The opportunity

Page 4: PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei ... › ... › Christian_Madsen_Health_Monitoring-nc.pdf · PRESENTATION TITLE GOES HERE Health monitoring & predictive analytics

42015 Data Storage Innovation Conference. © Seagate Technology. All Rights Reserved.

Optimize system management

Reduce potential cost of operation

Improve datacenter efficiency

What if…Seagate offered you a technology that could help you

Page 5: PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei ... › ... › Christian_Madsen_Health_Monitoring-nc.pdf · PRESENTATION TITLE GOES HERE Health monitoring & predictive analytics

52015 Data Storage Innovation Conference. © Seagate Technology. All Rights Reserved.

1 billion hard drives will be used in cloud datacenters by 2020, highlighting the need to manage drive health at scale

One total outage per datacenter is statistically expected every year

80% of those outages are not completely explained (or linked to root causes)

$700,000 is the average cost per incident$8,000 is the average cost per minute of an unplanned outage

Up to 10% of datacenter accidents are related to storage

56%

56% of 13ZB

7.3ZB, >1billion

drivesin cloud

Source: Seagate Strategic Marketing and Research 2013

2020

The problemFailures in storage lead to costly outages

Page 6: PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei ... › ... › Christian_Madsen_Health_Monitoring-nc.pdf · PRESENTATION TITLE GOES HERE Health monitoring & predictive analytics

62015 Data Storage Innovation Conference. © Seagate Technology. All Rights Reserved.

1. Drive health monitoring– Need reliable key performance indicators to track

drive health status

2. Drive failure prediction– “Ultimately, we want to know when our drives will

fail so we can take actions before that happens”

3. Drive failure diagnostics and management automation

– Need to correctly identify and quickly resolve issues– Need to prevent false alerts to reduce cost of failure

handling

4. Drive lifespan extension– Need to know how to optimize operating

environment for better reliability– Need to reuse partially good drives (should be

possible with in-drive diagnostic, IDD)

Better drive management will lower the TCOTop 4 challenges in drive management

Page 7: PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei ... › ... › Christian_Madsen_Health_Monitoring-nc.pdf · PRESENTATION TITLE GOES HERE Health monitoring & predictive analytics

PRESENTATION TITLE GOES HERE

Our vision & implementation

Page 8: PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei ... › ... › Christian_Madsen_Health_Monitoring-nc.pdf · PRESENTATION TITLE GOES HERE Health monitoring & predictive analytics

82015 Data Storage Innovation Conference. © Seagate Technology. All Rights Reserved.

• Drive-centric health monitoring • Analytics and predictive models• Closed-loop automation

Data Center

Global Access

S.M.A.R.T. Ganglia

Rack 1

Sources:

Alerts (2):1. 4% > WL (W+R) spec2. 1% > 2σ(T)

Details

Proxy-1

Swift-f1

Rack 2

Metrics:Read

Write

CPU

Temp

Bytes In

Bytes Out

√Temp (C)CPU (%)

Read (GB)Write (GB) √

Swift-f1

√Proxy-2

• Report storage health• Run drive self-test

• Shut-down systems• Repair drives• Run auto-FA

• Point at an issue• Highlight inefficiency

• Predict reliability• Detect anomalies

Actionable Decisions

ANALYTICS

PREDICTIONS

CONTROL

Data Aggregation

Early Warning System

Quick Issue Resolution

Cloud GazerTM

MONITORING

Our visionMonitoring, analytics, prediction and control – “The internet of things”2

Page 9: PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei ... › ... › Christian_Madsen_Health_Monitoring-nc.pdf · PRESENTATION TITLE GOES HERE Health monitoring & predictive analytics

92015 Data Storage Innovation Conference. © Seagate Technology. All Rights Reserved.

Monitor

Exception (alert)

Compliance (threshold)

Recommended action

Resolution

Closed-loop automation

Yes

No

Choose action

Monitor

Drive predicted to fail

Drive health

Reset or turn off drive

Turn off drive

Example

Passes

Not passing

Choose action

AutomationChoosing action from recommended action can be automated by tying it to the specific application or

saving choices

Functional diagramMonitoring, intelligent decisions and automation

Page 10: PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei ... › ... › Christian_Madsen_Health_Monitoring-nc.pdf · PRESENTATION TITLE GOES HERE Health monitoring & predictive analytics

102015 Data Storage Innovation Conference. © Seagate Technology. All Rights Reserved.

Real-time metric

aggregationCloud

GazerTM

Dashboard

• Query data• Check thresholds • Manage drives

Data pool (10,000s of drives) Server

Storage Software

REST API Calls

• Cluster• Server• Drive

Agents

Storage Server

Storage Server

Storage Server

Storage Server

ReSTful API

Drives Drives Drives Drives

Storage Server

Storage Server

Cloud GazerTM

Elements

Storage eco system

Analytics Engine

Cloud GazerTM

Drive Monitoring and Analytics

Drives

Drives

Storage Software

noSQLReSTful

API

Cloud GazerTM Dashboard

ImplementationArchitecture overview

Page 11: PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei ... › ... › Christian_Madsen_Health_Monitoring-nc.pdf · PRESENTATION TITLE GOES HERE Health monitoring & predictive analytics

PRESENTATION TITLE GOES HERE

Use cases

Page 12: PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei ... › ... › Christian_Madsen_Health_Monitoring-nc.pdf · PRESENTATION TITLE GOES HERE Health monitoring & predictive analytics

122015 Data Storage Innovation Conference. © Seagate Technology. All Rights Reserved.

Danger zone

Datacenter failure rate

Compliance (thresholds)Degradation and performance warnings

Page 13: PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei ... › ... › Christian_Madsen_Health_Monitoring-nc.pdf · PRESENTATION TITLE GOES HERE Health monitoring & predictive analytics

132015 Data Storage Innovation Conference. © Seagate Technology. All Rights Reserved.

Danger zone

Datacenter failure rate

Overload detectionDetecting and reporting when

drive load exceeds design limits

Compliance (thresholds)Degradation and performance warnings

Page 14: PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei ... › ... › Christian_Madsen_Health_Monitoring-nc.pdf · PRESENTATION TITLE GOES HERE Health monitoring & predictive analytics

142015 Data Storage Innovation Conference. © Seagate Technology. All Rights Reserved.

Danger zone

Datacenter failure rate

Overload detectionDetecting and reporting when

drive load exceeds design limits

Recommended actionHow to increase drive reliability

Compliance (thresholds)Degradation and performance warnings

Page 15: PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei ... › ... › Christian_Madsen_Health_Monitoring-nc.pdf · PRESENTATION TITLE GOES HERE Health monitoring & predictive analytics

152015 Data Storage Innovation Conference. © Seagate Technology. All Rights Reserved.

Danger zone

Datacenter failure rate

HDD population failure rateMeasuring stress and estimating

failure acceleration of the disk drive population in real time.

Relies on the proprietary failure prediction algorithms

Overload detectionDetecting and reporting when

drive load exceeds design limits

Recommended actionHow to increase drive reliability

Compliance (thresholds)Degradation and performance warnings

Page 16: PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei ... › ... › Christian_Madsen_Health_Monitoring-nc.pdf · PRESENTATION TITLE GOES HERE Health monitoring & predictive analytics

162015 Data Storage Innovation Conference. © Seagate Technology. All Rights Reserved.

Danger zone

Datacenter failure rate

Failure detectionWarning about expected drive failures.

Relies on the proprietary failure prediction algorithms that use unsupervised machine

learning techniques. Expected average failure prediction time window is from 9

days to 12 days.

HDD population failure rateMeasuring stress and estimating

failure acceleration of the disk drive population in real time.

Relies on the proprietary failure prediction algorithms

Overload detectionDetecting and reporting when

drive load exceeds design limits

Recommended actionHow to increase drive reliability

Compliance (thresholds)Degradation and performance warnings

Page 17: PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei ... › ... › Christian_Madsen_Health_Monitoring-nc.pdf · PRESENTATION TITLE GOES HERE Health monitoring & predictive analytics

172015 Data Storage Innovation Conference. © Seagate Technology. All Rights Reserved.

Workload predominantly

hitting one server

Before Load balancing issues After Workload distributed over servers and time

WeekDay Month

Workload peaked on

Sunday

Workload optimizationDrive visibility tools to improve workload balancing

Page 18: PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei ... › ... › Christian_Madsen_Health_Monitoring-nc.pdf · PRESENTATION TITLE GOES HERE Health monitoring & predictive analytics

182015 Data Storage Innovation Conference. © Seagate Technology. All Rights Reserved.

…Drives in field

Multivariate time-series monitoring

Apply failure prediction

algorithm in parallel in real-time

Real-time status

prediction of drive – Fine or going to

fail

For now, an average failure prediction window is on the order of 9 to 12 daysFailure prediction accuracy ranges from 55% to 90%

Unsupervised machine learning and failure predictionNo interaction between drive set, no prior knowledge

Page 19: PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei ... › ... › Christian_Madsen_Health_Monitoring-nc.pdf · PRESENTATION TITLE GOES HERE Health monitoring & predictive analytics

192015 Data Storage Innovation Conference. © Seagate Technology. All Rights Reserved.

Systematic failure predicted: 3 out of 5 drives predicted to fail sit in

end location of servers

Prediction and follow up actionsHeat map indicates drives at risk and you can issue drive tests (DST, IDD,…) to resolve or corroborate

Page 20: PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei ... › ... › Christian_Madsen_Health_Monitoring-nc.pdf · PRESENTATION TITLE GOES HERE Health monitoring & predictive analytics

202015 Data Storage Innovation Conference. © Seagate Technology. All Rights Reserved.

Systematic failure predicted: 3 out of 5 drives predicted to fail sit in

end location of servers

Common factors for drives in the end position is a cooler temperature. Therefore increasing the server

temperature may reduce the (dominant) failure mechanism and

increase drive reliability

Find failure triggersRoot cause tools including a drive temperature heat map can help you triage the cause of your drive issues

Page 21: PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei ... › ... › Christian_Madsen_Health_Monitoring-nc.pdf · PRESENTATION TITLE GOES HERE Health monitoring & predictive analytics

212015 Data Storage Innovation Conference. © Seagate Technology. All Rights Reserved.

Currently catch 55-90% of failures ahead of time

Case study 2, we predicted 5 drives to failed 23 days prior to failure, 2 drives 22 days prior to failure,… 2 drives just one

day in advance

Case study 1, we predicted most drives (118 drives) to fail 12 days prior to failure

Failure prediction lead timeWe can predict drives will fail on average 9 -10 days before the failure

Page 22: PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei ... › ... › Christian_Madsen_Health_Monitoring-nc.pdf · PRESENTATION TITLE GOES HERE Health monitoring & predictive analytics

PRESENTATION TITLE GOES HERE

Summary

Page 23: PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei ... › ... › Christian_Madsen_Health_Monitoring-nc.pdf · PRESENTATION TITLE GOES HERE Health monitoring & predictive analytics

232015 Data Storage Innovation Conference. © Seagate Technology. All Rights Reserved.

• Truly drive-centric management tool for the cloud

• Most efficient tool for extracting drive health information using

Seagate IPNobody knows drives better than usFreeware utilities are frequently wrong

• Runs on any Linux system with little overhead (<1%)

Windows is next

• Data can be collected, monitored and analyzed locally or in

the Cloud

• ReSTful API to interact with other software

• New analytics, prediction, AI, and control capabilities are

added continually

• Drive repair will be possible with in-drive diagnostic

• Enclosure control will be possible by summer 2015

Simply SMARTer

Competition

*Seagate drives

Why Cloud GazerTM?

Page 24: PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei ... › ... › Christian_Madsen_Health_Monitoring-nc.pdf · PRESENTATION TITLE GOES HERE Health monitoring & predictive analytics

PRESENTATION TITLE GOES HERE

Questions?