Learning at Scale is Hard! - USENIX

Preview:

Citation preview

Learning at Scale is Hard!

Outage Pattern Analysis and Dirty Data

Tanner LundMicrosoft Azure SRE

@101010Lund

@101010Lund Photo:RachelChapman(CC)

@101010Lund

@101010Lund

@101010Lund

@101010Lund

@101010Lund

@101010Lund Photo:MoRiza(CC)

@101010Lund Photo:RachelChapman(CC)

Learning (From Failure) At Scale

@101010Lund

Trends: Identified

@101010Lund

Antipatterns: Quashed

@101010Lund

Reliability Work:Actually Gets Done

Appropriately Prioritized

@101010Lund

@101010Lund

Data Scientists:

@101010Lund

Problem Management

@101010Lund

Problem: “The cause of one or more incidents” – Information Technology

Infrastructure Library (ITIL)

@101010Lund

@101010Lund Photo:RachelChapman(CC)

Sharing is caring!

@101010Lund

Gathering data

@101010Lund

Selecting models

@101010Lund

Training said models

@101010Lund

Evaluating models

@101010Lund

You know what was harder?

@101010Lund

Knowing what we’re actually looking for.

@101010Lund

IDK, something amazing!

¯\(°_o)/¯

@101010Lund

Fundamental Issue: ROOT CAUSES

@101010Lund

@101010Lund

Complex Systems fail in complex ways

@101010Lund

“Each of these small failures is necessary to cause catastrophe

but only a combination is sufficient to permit failure”

-Richard I. Cook, “How Complex Systems Fail”

@101010Lund

Let’s take a step back

@101010Lund

Why do we do RCAs?

@101010Lund

To stop bad stuff from happening (again)

@101010Lund

Hunting for Causes Problems Contributing Factors

@101010Lund

Outage (for our purposes):

Service or platform level issue that impacts customer experience

@101010Lund

Postmortem Text Analysis

@101010Lund

BeautifulSoupNLTK

GensimpyLDAvis

@101010Lund

@101010Lund

Not actionable.

@101010Lund

@101010Lund

Big Deal™

@101010Lund

Metrics!

@101010Lund

@101010Lund Photo:JudyWitts (cc)

Pain Value

@101010Lund

Pain Value=(No.ofoutages)*(duration)*(severity)*

(weightingfactor)

@101010Lund

Customers ImpactedRegions

Hardware SKUsDistance Below SLO

Number of breached SLOs

@101010Lund

Data Scientists:

@101010Lund

Pain Value=(No.ofoutages)*(duration)*(severity)*

(weightingfactor)

@101010Lund

Human interpretation still necessary

@101010LundPhoto:WikimediaCommons

@101010Lund

Missing/InsufficientData

@101010Lund

Incomplete Data

@101010Lund

InaccurateData

ItWasDefinitelyNetwork’sFault

OurCertsExpired

@101010Lund

Irrelevant Data

@101010Lund

Ambiguity

Node – CPUNode – Instance of ProgramNode – Physical Hardware BoxNode – Point on Graph such that G = (V,E)Node – Any device connected to the networkNode – Communication endpointNode – Client, Server, or PeerNode – Bitcoin minerNode – Data TypeNode – Node.js

@101010Lund

Confounding Factors

(like config drift)

@101010Lund

@101010Lund

Dirty data will lie to you.

@101010Lund

What was the (preliminary) result?

@101010Lund

1. Surfaced surprise issues

@101010Lund

2. Debunked production myths

@101010Lund

3. Stronger arguments for prioritization of reliability

work

@101010Lund

What did we learn?

@101010Lund

1. Define your hypotheses

@101010Lund

2. Clean your data

@101010Lund

3. Work your way up the DIKW pyramid

@101010Lund

What else can we do?

@101010Lund

Cross-Correlate Data Sets

@101010Lund

@101010Lund

Study your minor failures

@101010Lund

Intelligently Calculate Risk

@101010Lund

Continue to improve the RCA Process

@101010Lund

@101010Lund Photo:RachelChapman(CC)

@101010Lund

@101010Lund

Tanner Lund@101010Lundtalund@Microsoft.com/in/tannerlund

Recommended