75
Learning at Scale is Hard! Outage Pattern Analysis and Dirty Data Tanner Lund Microsoft Azure SRE @101010Lund

Learning at Scale is Hard! - USENIX

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Learning at Scale is Hard! - USENIX

Learning at Scale is Hard!

Outage Pattern Analysis and Dirty Data

Tanner LundMicrosoft Azure SRE

@101010Lund

Page 2: Learning at Scale is Hard! - USENIX

@101010Lund Photo:RachelChapman(CC)

Page 3: Learning at Scale is Hard! - USENIX

@101010Lund

Page 4: Learning at Scale is Hard! - USENIX

@101010Lund

Page 5: Learning at Scale is Hard! - USENIX

@101010Lund

Page 6: Learning at Scale is Hard! - USENIX

@101010Lund

Page 7: Learning at Scale is Hard! - USENIX

@101010Lund

Page 8: Learning at Scale is Hard! - USENIX

@101010Lund Photo:MoRiza(CC)

Page 9: Learning at Scale is Hard! - USENIX

@101010Lund Photo:RachelChapman(CC)

Page 10: Learning at Scale is Hard! - USENIX

Learning (From Failure) At Scale

@101010Lund

Page 11: Learning at Scale is Hard! - USENIX

Trends: Identified

@101010Lund

Page 12: Learning at Scale is Hard! - USENIX

Antipatterns: Quashed

@101010Lund

Page 13: Learning at Scale is Hard! - USENIX

Reliability Work:Actually Gets Done

Appropriately Prioritized

@101010Lund

Page 14: Learning at Scale is Hard! - USENIX

@101010Lund

Page 15: Learning at Scale is Hard! - USENIX

Data Scientists:

@101010Lund

Page 16: Learning at Scale is Hard! - USENIX

Problem Management

@101010Lund

Page 17: Learning at Scale is Hard! - USENIX

Problem: “The cause of one or more incidents” – Information Technology

Infrastructure Library (ITIL)

@101010Lund

Page 18: Learning at Scale is Hard! - USENIX

@101010Lund Photo:RachelChapman(CC)

Page 19: Learning at Scale is Hard! - USENIX

Sharing is caring!

@101010Lund

Page 20: Learning at Scale is Hard! - USENIX

Gathering data

@101010Lund

Page 21: Learning at Scale is Hard! - USENIX

Selecting models

@101010Lund

Page 22: Learning at Scale is Hard! - USENIX

Training said models

@101010Lund

Page 23: Learning at Scale is Hard! - USENIX

Evaluating models

@101010Lund

Page 24: Learning at Scale is Hard! - USENIX

You know what was harder?

@101010Lund

Page 25: Learning at Scale is Hard! - USENIX

Knowing what we’re actually looking for.

@101010Lund

Page 26: Learning at Scale is Hard! - USENIX

IDK, something amazing!

¯\(°_o)/¯

@101010Lund

Page 27: Learning at Scale is Hard! - USENIX

Fundamental Issue: ROOT CAUSES

@101010Lund

Page 28: Learning at Scale is Hard! - USENIX

@101010Lund

Page 29: Learning at Scale is Hard! - USENIX

Complex Systems fail in complex ways

@101010Lund

Page 30: Learning at Scale is Hard! - USENIX

“Each of these small failures is necessary to cause catastrophe

but only a combination is sufficient to permit failure”

-Richard I. Cook, “How Complex Systems Fail”

@101010Lund

Page 31: Learning at Scale is Hard! - USENIX

Let’s take a step back

@101010Lund

Page 32: Learning at Scale is Hard! - USENIX

Why do we do RCAs?

@101010Lund

Page 33: Learning at Scale is Hard! - USENIX

To stop bad stuff from happening (again)

@101010Lund

Page 34: Learning at Scale is Hard! - USENIX

Hunting for Causes Problems Contributing Factors

@101010Lund

Page 35: Learning at Scale is Hard! - USENIX

Outage (for our purposes):

Service or platform level issue that impacts customer experience

@101010Lund

Page 36: Learning at Scale is Hard! - USENIX

Postmortem Text Analysis

@101010Lund

Page 37: Learning at Scale is Hard! - USENIX

BeautifulSoupNLTK

GensimpyLDAvis

@101010Lund

Page 38: Learning at Scale is Hard! - USENIX

@101010Lund

Page 39: Learning at Scale is Hard! - USENIX

Not actionable.

@101010Lund

Page 40: Learning at Scale is Hard! - USENIX

@101010Lund

Page 41: Learning at Scale is Hard! - USENIX

Big Deal™

@101010Lund

Page 42: Learning at Scale is Hard! - USENIX

Metrics!

@101010Lund

Page 43: Learning at Scale is Hard! - USENIX

@101010Lund Photo:JudyWitts (cc)

Page 44: Learning at Scale is Hard! - USENIX

Pain Value

@101010Lund

Page 45: Learning at Scale is Hard! - USENIX

Pain Value=(No.ofoutages)*(duration)*(severity)*

(weightingfactor)

@101010Lund

Page 46: Learning at Scale is Hard! - USENIX

Customers ImpactedRegions

Hardware SKUsDistance Below SLO

Number of breached SLOs

@101010Lund

Page 47: Learning at Scale is Hard! - USENIX

Data Scientists:

@101010Lund

Page 48: Learning at Scale is Hard! - USENIX

Pain Value=(No.ofoutages)*(duration)*(severity)*

(weightingfactor)

@101010Lund

Page 49: Learning at Scale is Hard! - USENIX

Human interpretation still necessary

@101010LundPhoto:WikimediaCommons

Page 50: Learning at Scale is Hard! - USENIX

@101010Lund

Page 51: Learning at Scale is Hard! - USENIX

Missing/InsufficientData

@101010Lund

Page 52: Learning at Scale is Hard! - USENIX

Incomplete Data

@101010Lund

Page 53: Learning at Scale is Hard! - USENIX

InaccurateData

ItWasDefinitelyNetwork’sFault

OurCertsExpired

@101010Lund

Page 54: Learning at Scale is Hard! - USENIX

Irrelevant Data

@101010Lund

Page 55: Learning at Scale is Hard! - USENIX

Ambiguity

Node – CPUNode – Instance of ProgramNode – Physical Hardware BoxNode – Point on Graph such that G = (V,E)Node – Any device connected to the networkNode – Communication endpointNode – Client, Server, or PeerNode – Bitcoin minerNode – Data TypeNode – Node.js

@101010Lund

Page 56: Learning at Scale is Hard! - USENIX

Confounding Factors

(like config drift)

@101010Lund

Page 57: Learning at Scale is Hard! - USENIX

@101010Lund

Page 58: Learning at Scale is Hard! - USENIX

Dirty data will lie to you.

@101010Lund

Page 59: Learning at Scale is Hard! - USENIX

What was the (preliminary) result?

@101010Lund

Page 60: Learning at Scale is Hard! - USENIX

1. Surfaced surprise issues

@101010Lund

Page 61: Learning at Scale is Hard! - USENIX

2. Debunked production myths

@101010Lund

Page 62: Learning at Scale is Hard! - USENIX

3. Stronger arguments for prioritization of reliability

work

@101010Lund

Page 63: Learning at Scale is Hard! - USENIX

What did we learn?

@101010Lund

Page 64: Learning at Scale is Hard! - USENIX

1. Define your hypotheses

@101010Lund

Page 65: Learning at Scale is Hard! - USENIX

2. Clean your data

@101010Lund

Page 66: Learning at Scale is Hard! - USENIX

3. Work your way up the DIKW pyramid

@101010Lund

Page 67: Learning at Scale is Hard! - USENIX

What else can we do?

@101010Lund

Page 68: Learning at Scale is Hard! - USENIX

Cross-Correlate Data Sets

@101010Lund

Page 69: Learning at Scale is Hard! - USENIX

@101010Lund

Page 70: Learning at Scale is Hard! - USENIX

Study your minor failures

@101010Lund

Page 71: Learning at Scale is Hard! - USENIX

Intelligently Calculate Risk

@101010Lund

Page 72: Learning at Scale is Hard! - USENIX

Continue to improve the RCA Process

@101010Lund

Page 73: Learning at Scale is Hard! - USENIX

@101010Lund Photo:RachelChapman(CC)

Page 74: Learning at Scale is Hard! - USENIX

@101010Lund

Page 75: Learning at Scale is Hard! - USENIX

@101010Lund

Tanner Lund@[email protected]/in/tannerlund