Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy...

Preview:

Citation preview

Pitfalls in Measuring SLOsDanyel Fisher@fisherdanyel

Liz Fong-Jones@lizthegrey

@fisherdanyel @lizthegrey

@fisherdanyel @lizthegrey

@fisherdanyel @lizthegrey

What do you do when things break?

How bad was this break?

@fisherdanyel @lizthegrey

@fisherdanyel @lizthegrey

Build new features! We need to improve quality!

@fisherdanyel @lizthegrey

Management

Engineering

Clients and Users

How broken is “too broken”?

What does “good enough” mean?

Combatting alert fatigue

@fisherdanyel @lizthegrey

A telemetry system produces events that correspond to real world use

We can describe some of these events as eligible

We can describe some of them as good

@fisherdanyel @lizthegrey

Given an event, is it eligible? Is it good?

Eligible: “Had an http status code”

Good: “... that was a 200, and was served under 500 ms”

@fisherdanyel @lizthegrey

@fisherdanyel @lizthegrey

Minimum Quality ratio over a period of time

Number of bad events allowed.

@fisherdanyel @lizthegrey

Deploy faster

Room for experimentation

Opportunity to tighten SLO

@fisherdanyel @lizthegrey

We always store incoming user data

99.99%

@fisherdanyel @lizthegrey

We always store incoming user data

99.99%

Default dashboards usually load in < 1s

99.9%

@fisherdanyel @lizthegrey

We always store incoming user data

99.99%

Queries often return in < 10 s

Default dashboards usually load in < 1s

99.9%

99%

@fisherdanyel @lizthegrey

We always store incoming user data

99.99%

Queries often return in < 10 s

Default dashboards usually load in < 1s

99.9%

99% 7.3 hours

45 minutes

~4.3 minutes

@fisherdanyel @lizthegrey

@fisherdanyel @lizthegrey

User Data Throughput

@fisherdanyel @lizthegrey

User Data Throughput

We blew through three months’ budget in those 12

minutes.

@fisherdanyel @lizthegrey

We dropped customer data

@fisherdanyel @lizthegrey

We dropped customer data

We rolled it back (manually)

We communicated to customers

We halted deploys

@fisherdanyel @lizthegrey

We checked in code that didn’t build.

@fisherdanyel @lizthegrey

We checked in code that didn’t build.

We had experimental CI build wiring.

@fisherdanyel @lizthegrey

We checked in code that didn’t build.

We had experimental CI build wiring.

Our scripts deployed empty binaries.

@fisherdanyel @lizthegrey

We checked in code that didn’t build.

We had experimental CI build wiring.

Our scripts deployed empty binaries.

There was no health check and rollback.

@fisherdanyel @lizthegrey

We stopped writing new features

We prioritized stability

We mitigated risks

@fisherdanyel @lizthegrey

SLOs allowed us to characterize

what went wrong, how badly it went wrong,

andhow to prioritize repair

@fisherdanyel @lizthegrey

Learning from SLOs

@fisherdanyel @lizthegrey

Final pointA one-line description of it

@fisherdanyel @lizthegrey

SLOs, as espoused at Google vs in practice

We told you "SLOs are magical".

But not everyone has Google's tooling.

And SLO practice at Google is uneven.

Debugging SLOs often is divorced from reporting SLOs.

There had to be a better way, leveraging Honeycomb...

@fisherdanyel @lizthegrey

Using a Design Thinking perspective:

● Expressing and Viewing SLOs● Burndown Alerts and Responding● Learning from our Experiences

@fisherdanyel @lizthegrey

Displays and Views

@fisherdanyel @lizthegrey

See where the burndown was happening, explain why, and remediate

@fisherdanyel @lizthegrey

Expressing SLOs

Time based: “How many 15 minute periods, had a P99(duration) < 50 ms”

Event based: “How many events had a duration < 500 ms”

@fisherdanyel @lizthegrey

Status of an SLO

@fisherdanyel @lizthegrey

How have we done?

@fisherdanyel @lizthegrey

@fisherdanyel @lizthegrey

Where did it go?

@fisherdanyel @lizthegrey

When did the errors happen?

@fisherdanyel @lizthegrey

When did the errors happen?

@fisherdanyel @lizthegrey

What went wrong?

High dimensional data

High cardinality data

@fisherdanyel @lizthegrey

Why did it happen?

@fisherdanyel @lizthegrey

Why did it happen?

@fisherdanyel @lizthegrey

Why did it happen?

@fisherdanyel @lizthegrey

See where the burndown was happening, explain why, and remediate

@fisherdanyel @lizthegrey

User Feedback

“I’d love to drive alerts off our SLOs. Right now we don’t have anything to draw us in and have some alerts on the average error rate but they’re a little spiky to be useful.”

@fisherdanyel @lizthegrey

Burndown Alerts

@fisherdanyel @lizthegrey

How is my system doing?Am I over budget?

When will my alarm fail?

@fisherdanyel @lizthegrey

When will I fail?

User goal: get alerts to exhaustion time

Human-digestible units

24 hours: “I’ll take a look in the morning”

4 hours: “All hands on deck!”

@fisherdanyel @lizthegrey

@fisherdanyel @lizthegrey

How is my system doing?Am I over budget?

When will my alarm fail?

@fisherdanyel @lizthegrey

Implementing Burn Alerts

Run a 30 day query

@fisherdanyel @lizthegrey

Implementing Burn Alerts

Run a 30 day query

at a 5 minute resolution

@fisherdanyel @lizthegrey

Implementing Burn Alerts

Run a 30 day query

at a 5 minute resolution

every minute

@fisherdanyel @lizthegrey

@fisherdanyel @lizthegrey

Learning from Experience

@fisherdanyel @lizthegrey

Volume is importantTolerate at least dozens of bad events per day

Faults

@fisherdanyel @lizthegrey

SLOs for Customer ServiceRemember that user having a bad day?

ADD IMAGE

@fisherdanyel @lizthegrey

Blackouts are easy… but brownouts are much more interesting

@fisherdanyel @lizthegrey

@fisherdanyel @lizthegrey

Timeline1:29 am SLO alerts. “Maybe it’s just a blip”

1.5% brownout for 20 minutes

@fisherdanyel @lizthegrey

Timeline1:29 am SLO alerts. “Maybe it’s just a blip”

1.5% brownout for 20 minutes4:21 am Minor incident. “It might be an AWS problem”

@fisherdanyel @lizthegrey

Timeline1:29 am SLO alerts. “Maybe it’s just a blip”

1.5% brownout for 20 minutes4:21 am Minor incident. “It might be an AWS problem”6:25 am SLO alerts again. “Could it be ALB compat?”

@fisherdanyel @lizthegrey

Timeline1:29 am SLO alerts. “Maybe it’s just a blip”4:21 am Minor incident. “It might be an AWS problem”6:25 am SLO alerts again. “Could it be ALB compat?”9:55 am “Why is our system uptime dropping to zero?”

It’s out of memoryWe aren’t alerting on that crash

@fisherdanyel @lizthegrey

Timeline1:29 am SLO alerts. “Maybe it’s just a blip”

1.5% brownout for 20 minutes4:21 am Minor incident. “It might be an AWS problem”6:25 am SLO alerts again. “Could it be ALB compat?”9:55 am “Why is our system uptime dropping to zero?”

It’s out of memoryWe aren’t alerting on that crash

@fisherdanyel @lizthegrey

Timeline1:29 am SLO alerts. “Maybe it’s just a blip”

1.5% brownout for 20 minutes4:21 am Minor incident. “It might be an AWS problem”6:25 am SLO alerts again. “Could it be ALB compat?”9:55 am “Why is our system uptime dropping to zero?”

It’s out of memoryWe aren’t alerting on that crash

10:32 am Fixed

@fisherdanyel @lizthegrey

@fisherdanyel @lizthegrey

@fisherdanyel @lizthegrey

Cultural ChangeIt’s hard to replace alerts with SLOs

But a clear incident can help

@fisherdanyel @lizthegrey

Reduce Alarm FatigueFocus on user-affecting SLOs

Focus on actionable alarms

@fisherdanyel @lizthegrey

Conclusion

@fisherdanyel @lizthegrey

SLOs allowed us to characterize

what went wrong,how badly it went wrong,

andhow to prioritize repair

@fisherdanyel @lizthegrey

You can do it tooAnd maybe avoid our mistakes

Pitfalls in Measuring SLOs

Email: danyel@honeycomb.io / lizf@honeycomb.io

Twitter: @fisherdanyel & @lizthegrey

Come talk to us in Slack!

hny.co/danyel

Recommended