107
hi. Hi everybody

Velocity NY 2014: Signal through the noise (with presenter notes)

Embed Size (px)

DESCRIPTION

In recent years it’s become evident that alerting is one of the biggest challenges facing modern Operations Engineers. Conference talks, hallways tracks, meetups, etc are rife with discussions about poor signal/noise in alerts, fatigue from false positives, and general lack of actionability. Our talk (informed by real-world experience designing, building and maintaining our distributed, multi-tenant metrics/alerting service) takes a fundamental approach and examines alerting requirements and practices in the abstract. We put forth a comprehensive abstract model with best practices that should be followed and implemented by your team regardless of your tool of choice. This talk is equal parts cultural and technical, encompassing both computational capabilities as well as social practices, like: Defining organizational policy about where and when to set alerts. Ensuring the on-call engineer is armed with the proper information to take action Best practices for configuring an alert Fire-fighting after an alert has triggered Performing analysis across your organization wide history of alerts

Citation preview

Page 1: Velocity NY 2014: Signal through the noise (with presenter notes)

hi.

Hi everybody

Page 2: Velocity NY 2014: Signal through the noise (with presenter notes)

[email protected]

@davejosephsen

github: djosephsen

How We Computer

i’m dave josephsen. I’m the developer evangelist at librato, and, I’ll also mention

Page 3: Velocity NY 2014: Signal through the noise (with presenter notes)

[email protected]

@davejosephsen

github: djosephsen

How We Computer

since this is an orielly conference, I’m one of the co-authors of the orielly book on ganglia, which by the way has nothing to do with brain tumors.

Page 4: Velocity NY 2014: Signal through the noise (with presenter notes)

[email protected]

@davejosephsen

github: djosephsen

How We Computer

I’ve actually written several books which is exciting because they all have squiggly things on the cover. So not only am I batting 100 in this regard I am the worlds foremost expert, if you have a book project and plan to have a squiggly thing on the cover, you’ll find my contact info just there.

Page 5: Velocity NY 2014: Signal through the noise (with presenter notes)

[email protected]

@davejosephsen

github: djosephsen

Signal Through the Noise

I’m here today to talk to you about alerting, and how to make alerting better mainly by improving the input signal that you emit to your notification system

Page 6: Velocity NY 2014: Signal through the noise (with presenter notes)

and so our journey begins in the wee early morning ours of the predawn. You’re in bed, happily dreaming.

Page 7: Velocity NY 2014: Signal through the noise (with presenter notes)

And in your dream you come across a tree that is growing Tracy Chapman, and you’re like sweet, I love tracy chapman but you’re

Page 8: Velocity NY 2014: Signal through the noise (with presenter notes)

a good person, and you don’t want to bogart all the Tracy Chapman, you pick a small one for yourself. And you’re like ohmygod you ROCK miniature Tracy Chapman

Page 9: Velocity NY 2014: Signal through the noise (with presenter notes)

And miniature Tracy Chapmann looks down at her guitar, and smiles in that knowing way she does, and begins to sing

Page 10: Velocity NY 2014: Signal through the noise (with presenter notes)

but instead of sultry blues, nothing but a horrible combination of buzzing and ringing comes out, and then she drops her guitar and starts smacking you in the face. But it’s not your dream face, it’s your real face, because you realize you’re dreaming, so you struggle to open your eyes

Page 11: Velocity NY 2014: Signal through the noise (with presenter notes)

and find that your cat is sitting on your neck, pummeling you repeatedly in the face and your phone is making this horrible buzzing/ringing noise. Why is your phone going off in the middle of the night? Has something horrible happened? Is someone hurt? Is it the end of the world? So now you have a huge dump of adrenalin and fear, and you clumsily shoo the cat away, knocking your lamp off the table in the process, so without the benefit of light, you grope blindly for your phone, ripping it from the charger, and barely manage to get it unlocked to stop the noise,

Page 12: Velocity NY 2014: Signal through the noise (with presenter notes)

WAT?

only to be presented with this. It takes you a few moments to understand what you’re seeing. It looks like a web balancer is complaining about some of its hosts being unresponsive. 2nd adrenaline What if you can’t figure out what’s going on? What if you can’t fix it? What if this is just the first of a deluge of alerts signaling the death of your entire infrastructure?

Page 13: Velocity NY 2014: Signal through the noise (with presenter notes)

WAT?

So now, conscious enough to locate and grab your laptop, you rip it open and check the various graphs you have at your disposal. Everything looks normal on the bandwidth graph, so you begin to ssh in to the balancer to see if you can tell which hosts are actually down, when you get..

Page 14: Velocity NY 2014: Signal through the noise (with presenter notes)

AAAGHHHHH!!!

a recovery notification from your monitoring system. yeah, sorry. False alarm are you kidding me?! 4:30 in the morning. Your heart is beating heavily in your chest. You are wired, angry, and wide awake, and your morning alarm will go off in 2.5 hours. You don’t have time to calm down and get any meaningful sleep. and your productivity will be measurably impaired for the rest of the day

Page 15: Velocity NY 2014: Signal through the noise (with presenter notes)

ALERTS AREN’T FREE

So the first point I’d like to make is, alerts are actually really expensive. Not only do they hurt people and disrupt peoples lives, they’re a huge burden to productivity.

Page 16: Velocity NY 2014: Signal through the noise (with presenter notes)

Business Projects

IT Projects

Changes

Unplanned Work

If you’ve studied the Gene Kim school of devops , you know there are four types of work, and that one among them, unplanned work, is the most maligned.

Page 17: Velocity NY 2014: Signal through the noise (with presenter notes)

Unplanned Work

(eeew Comic Sans)

unplanned work is basically toxic. It disrupts every other kind of work, and causes good people undue grief. If it were a font, it would be Comic Sans. and so if you could take

Page 18: Velocity NY 2014: Signal through the noise (with presenter notes)

Unplanned Work

So if you could take unplanned work, and load it into a bullet

Page 19: Velocity NY 2014: Signal through the noise (with presenter notes)

Unplanned Work

and then load that bullet into a cannon, and shoot it at happy, and otherwise productive people

Page 20: Velocity NY 2014: Signal through the noise (with presenter notes)

Alerting

That’s basically what we’re doing with Alerting. We’re packaging up unplanned work and launching it at people like a bunch of angsty romans with a trebuche.

Page 21: Velocity NY 2014: Signal through the noise (with presenter notes)

Tax the Ammunition

and we do it without giving it a second thought our tools make it cheap and easy to fire alerts, so maybe we should make the bullets more expensive

Page 22: Velocity NY 2014: Signal through the noise (with presenter notes)

and this turns out to be a pretty good idea, we can reduce the amount of unplanned work by increasing the cost of individual alerts. All we need to do is require that every alert include a run-book URL.

Page 23: Velocity NY 2014: Signal through the noise (with presenter notes)

I can personally attest that work very well. As part of the process of building software at librato, we make run books in markdown, commit them to a github repo and link to them from our alerts. They include firefighting and design info and links to things like

Page 24: Velocity NY 2014: Signal through the noise (with presenter notes)

visualizations that point back to the metrics that caused the notification

Page 25: Velocity NY 2014: Signal through the noise (with presenter notes)

THE CONTENT OF YOUR ALERTS MATTERS

So to start, with some policy enforcement, we can reduce the number of alerts we send, and if you’re looking for a quick fix that can have a meaningful impact on alert quality this is a good start, but really that’s just the tip of the iceberg

Page 26: Velocity NY 2014: Signal through the noise (with presenter notes)

What did he just say?

•Notifications are expensive, they hurt people and productivity

•Make people work harder to send them by requiring run books

•Run books add context to alerts. Other types of context are awesome too

•Like graphs

I hope you don’t mind but I’ve taken the liberty of interspersing a few recap slides throughout this presentation for the benefit of the people who are going to see this on slide share and be all “was that a tracy chapmann tree?”

Page 27: Velocity NY 2014: Signal through the noise (with presenter notes)

WHY do we Monitor?

So anyhow, run books are a good start but how do we fix this? well lets consider our motivation. What are we trying to accomplish that’s important enough to invade someones dreams and interrupt their workflow? Why do we monitor?

Page 28: Velocity NY 2014: Signal through the noise (with presenter notes)

And I think for most of us, it’s really simple. We have this thing we care about, just like any other engineering discipline where the thing

Page 29: Velocity NY 2014: Signal through the noise (with presenter notes)

Might be a complicated machine, like a satellite, or

Page 30: Velocity NY 2014: Signal through the noise (with presenter notes)

or maybe it’s organic like a human heart

Page 31: Velocity NY 2014: Signal through the noise (with presenter notes)

but whatever our thing is, it has to interact with the real world, and therefore we can’t fully control what happens to it.

Page 32: Velocity NY 2014: Signal through the noise (with presenter notes)

Telemetry Data

Command Signal

so we use engineering as a means of getting a steady stream of feedback from the thing we care about, so that we can be sure that it’s operating within the boundaries we think are healthy.

Page 33: Velocity NY 2014: Signal through the noise (with presenter notes)

and obviously, because systems are different, these characteristics will vary, and maybe we’ll use different techniques to collect that data.

Page 34: Velocity NY 2014: Signal through the noise (with presenter notes)

1. Identify Operational LimitationsY<160bpm

X<7m km/h

but in every case our underlying strategy is the same. First we consider the operational limitations of the thing we care about— That it should beat no more than 160 times a minute, or that its velocity should not exceed 7million km/h

Page 35: Velocity NY 2014: Signal through the noise (with presenter notes)

2. Monitor those limitations1. Identify Operational Limitations

Y<160bpmX<7m km/h

and then we engineer the system to provide us the feedback we need to detect when those limits are reached.

Page 36: Velocity NY 2014: Signal through the noise (with presenter notes)

So taking look at this alert we got at 4 in the morning,

Page 37: Velocity NY 2014: Signal through the noise (with presenter notes)

A Balancer ?!

The first thing we’ve done wrong is misidentifying the thing we care about. In web operations, grown-ups care about websites, not arbitrary individual balancers, but yet somehow this balancer has become the focus of our monitoring

Page 38: Velocity NY 2014: Signal through the noise (with presenter notes)

Balancer

>66% Host Availability

which brings me to problem 2, which is that we’ve chosen a silly metric for the thing we don’t care about

Page 39: Velocity NY 2014: Signal through the noise (with presenter notes)

Balancer

>66% Host Availability

By saying if 34% of the eleventybillion ephemeral server instances behind this balancer go down, all productivity must cease. I mean, I suspect that this metric has been chosen for us

Page 40: Velocity NY 2014: Signal through the noise (with presenter notes)

% IO per instance

because Even if I’m running a balancer as a service company and balancing is the thing I care about, host availability is a derpy metric, because

Page 41: Velocity NY 2014: Signal through the noise (with presenter notes)

%hosts alive

% IO per instanceVS

(Hint: one of these things measures balancing)

it doesn’t tell me how good a job I’m doing at balancing things. I’d rather know the ratio of IO per back-end instance. Because it tells me about the thing I care about

Page 42: Velocity NY 2014: Signal through the noise (with presenter notes)

%hosts alive

% IO per instance

Does not measure balancing Measures balancing

66 .2VSXIO per host is a metric I can use in a threshold. I can say, If %IO per host goes above .2 then shits probably going sideways. So this was a horrible choice of both, the thing we care about, and a metric for that thing

Page 43: Velocity NY 2014: Signal through the noise (with presenter notes)

Then, to top it off, instead of engineering the thing we don’t care about to provide feedback for our silly metric, we’ve tasked an external computer to check every few minutes or so and then

Page 44: Velocity NY 2014: Signal through the noise (with presenter notes)

and punch us in the face whenever our absurd metric cross a threshold. As far as I can tell, this approach is unique to our field.

Page 45: Velocity NY 2014: Signal through the noise (with presenter notes)

If we were aerospace engineers, this would be like crafting a sketchy instrument panel that only updated every 5 minutes and paying some teenager to watch it for us.

Page 46: Velocity NY 2014: Signal through the noise (with presenter notes)

but aerospace engineers are doing something very different from what many of us do in IT today. They’re

Page 47: Velocity NY 2014: Signal through the noise (with presenter notes)

they’re reasoning about the limitations of the things they care about, and then engineering those things to provide telemetry feedback to their operators. Are we beginning to see a theme emerge?

Page 48: Velocity NY 2014: Signal through the noise (with presenter notes)

IT Monitoring != Feedback

So the fundamental observation I’m making here is this thing we do that we call monitoring

Page 49: Velocity NY 2014: Signal through the noise (with presenter notes)

IT Monitoring != Feedback

is not the same as what other engineers in other disciplines are doing when they talk about monitoring

Page 50: Velocity NY 2014: Signal through the noise (with presenter notes)

some silly balancer!=

And that’s the real problem, because this notion we have that monitoring is something distinct from our engineering pursuits — that building the things we care about goes to this team — and monitoring things in general goes to that team, causes derby and stupid feedback loops.

Page 51: Velocity NY 2014: Signal through the noise (with presenter notes)

WE CAN REDUCE ALERTS BY IMPROVING OUR TELEMETRY

SIGNAL

So I think, if we reintegrate monitoring with our day-to-day engineering activities, and we carefully choose metrics that give us the feedback we need, our alerting will improve along with our input signal.

Page 52: Velocity NY 2014: Signal through the noise (with presenter notes)

What did he just say?•Monitoring isn't a thing. It’s just part of the engineering process

•We’re treating it like a thing that only some types of engineers might want to do, and that’s giving us broken feedback

•Aerospace engineers are rad, they don’t do that.

•Fix your monitoring and your alerts will follow

And I’ll pause there to give you a moment to reflect on what you’ve just heard, and remind you that it’s ok if you can’t read these recap slides, because they’re here for the slide-share people who couldn’t afford to come here.

Page 53: Velocity NY 2014: Signal through the noise (with presenter notes)

If you think I’m dreaming about this whole reintegrate monitoring thing, let me show you what a bad day looks like inside Librato. A couple months ago this happened.

Page 54: Velocity NY 2014: Signal through the noise (with presenter notes)

We’re a chatops shop, so we got these two notifications from a chatbot in campfire. The first one says our API is throwing an exception because it’s not able to queue a data point, and the second says that one of our RDS instances has too many persistent connections, so it’s telling new clients to go away.

Page 55: Velocity NY 2014: Signal through the noise (with presenter notes)

So as engineers when we see a problem like this we naturally envision the pieces of the thing we care about

Page 56: Velocity NY 2014: Signal through the noise (with presenter notes)

or at least the pieces of it that are relevant to the problem that we’re considering

Page 57: Velocity NY 2014: Signal through the noise (with presenter notes)

in this case we have an incoming blob of json, that's gotten lodged in the api

Page 58: Velocity NY 2014: Signal through the noise (with presenter notes)

and we have an RDS that's unresponsive. and we can’t help but ask ourselves, do these problems correlate? does the API talk to RDS for anything?

Page 59: Velocity NY 2014: Signal through the noise (with presenter notes)

and it turns out, yes it does, very early on the API interacts with the rds to do some name-to-uid translation, and because the res is not responding, the api is unable to queue this datapoint. And this intuitive leap that we just made,

Page 60: Velocity NY 2014: Signal through the noise (with presenter notes)

this imagining the pieces and their interaction in the context of the problem is pretty much what happens in the head of every engineer who knows this system and sees this problem. And our collective mental model for this problem inevitably implies a few followup questions

Page 61: Velocity NY 2014: Signal through the noise (with presenter notes)

Is this problem affecting end-users?

Page 62: Velocity NY 2014: Signal through the noise (with presenter notes)

Is it backing up the SLB?

Page 63: Velocity NY 2014: Signal through the noise (with presenter notes)

or blocking the unicorns?

Page 64: Velocity NY 2014: Signal through the noise (with presenter notes)

Are the RDS read replicas also timing out? Because maybe we can point the api at a replica to keep this ship from sinking

Page 65: Velocity NY 2014: Signal through the noise (with presenter notes)

Own YOUR problem

But at Librato, rather than reaching for ssh terminals to answer these questions, the next thing we see in chat

Page 66: Velocity NY 2014: Signal through the noise (with presenter notes)

Own YOUR problem

is that our Ops chief has put some graphs up for us in the war room.

Page 67: Velocity NY 2014: Signal through the noise (with presenter notes)

Some Graph in the War Room

heading over there we see various engineers have already begun to fill in these gaps for us because…

Page 68: Velocity NY 2014: Signal through the noise (with presenter notes)

by default we share a common vision of the problem, so we also share the questions, and so we can answer them collaboratively by sharing telemetry data

Page 69: Velocity NY 2014: Signal through the noise (with presenter notes)

Yes this is affecting end users because we see a huge drop in http 200's

Page 70: Velocity NY 2014: Signal through the noise (with presenter notes)

yes this is affecting the SLB because we see a huge latency spike..

Page 71: Velocity NY 2014: Signal through the noise (with presenter notes)

yes this is derping the unicorns because we see the same latency in the api

Page 72: Velocity NY 2014: Signal through the noise (with presenter notes)

and read replicas wont save us, because this problem is affecting their ability to replicate

Page 73: Velocity NY 2014: Signal through the noise (with presenter notes)

Some Graph in the War Room

So as the problem persists our engineers ask more questions,

Page 74: Velocity NY 2014: Signal through the noise (with presenter notes)

Some Graph in the War Room

each of which are answered by metric data, which beg additional questions which get answered by metric data in a feedback loop until they reach a point where some meaningful action can take place. In this case that action wound up being rolling back a deployment that introduced a derpy api query

Page 75: Velocity NY 2014: Signal through the noise (with presenter notes)

WHAT YOU MONITOR MATTERS

At Librato, this is what problem solving looks like every single time. We get notifications of actual trouble, and we rely on the same telemetry data that triggered those notifications to solve our problem. We can do this because we know what we care about, and we’ve engineered those things to provide us feedback.

Page 76: Velocity NY 2014: Signal through the noise (with presenter notes)

And it’s so simple to pull off. When we build something like a worker

Page 77: Velocity NY 2014: Signal through the noise (with presenter notes)

that’s going to consume some input from another service

Page 78: Velocity NY 2014: Signal through the noise (with presenter notes)

queue the rest, we inevitably wind up codifying a set of assumptions; operational boundaries that keep this worker happy.

Page 79: Velocity NY 2014: Signal through the noise (with presenter notes)

a } < x

C

And these might be things like process A requires that service C responds with a 99th percentile latency below X

Page 80: Velocity NY 2014: Signal through the noise (with presenter notes)

} < x

b

kxa

Or that process B requires that the queue never exceeds K elements.

Page 81: Velocity NY 2014: Signal through the noise (with presenter notes)

xk

xk

xk

Whenever our process depends on assumptions like these, when the thing we care about is threatened by these limits, we build-in instrumentation to measure and report them to a common upstream metrics repository. We don’t assign a team to commit to a tool, that’s going to dictate or even influence our metric choices. Instead,

Page 82: Velocity NY 2014: Signal through the noise (with presenter notes)

EVERYBODY OWNS MONITORING

We empower and expect everyone to measure the things we care about, in the best possible way, and we make it easy to store the result together in the same place. Monitoring is literally everyones responsibility who builds a thing we care about, be it infrastructure or applications.

Page 83: Velocity NY 2014: Signal through the noise (with presenter notes)

As my CTO Joe is fond of saying, we have a show-me-the-graph culture. Whenever we make a hypothesis about production behavior, our metrics naturally become part of the conversation. No one points a finger at anything without having the data to back it up. In this way…

Page 84: Velocity NY 2014: Signal through the noise (with presenter notes)

Data guides our intuition, because we simply have no choice, it’s there, everybody knows how to use it, and there’s just no fooling it. Telemetry feedback is anathema to speculation, it confirms our assertions, discredits them utterly, or Identifies metrics that we should be tracking in the things we care about.

Page 85: Velocity NY 2014: Signal through the noise (with presenter notes)

and I don’t think any of us would trade it for the world, because it’s like this ever-present rock we can stand on when we’re struggling to understand the behavior the things we care about in the wild. Weather we’re dealing with issues, debugging regressions or shipping features it tells us, without a doubt what’s happening to the things we care about

Page 86: Velocity NY 2014: Signal through the noise (with presenter notes)

So our metrics are dear to us. We value and nurture the metrics we choose. We share them with each other in the form of prized insight. When we add new ones we do so deliberately and only because they teach us about the things we care about

Page 87: Velocity NY 2014: Signal through the noise (with presenter notes)

So for us, Monitoring isn’t a separate activity anymore. It’s just part of correctly building the things we care about..

Page 88: Velocity NY 2014: Signal through the noise (with presenter notes)

We improve it, and iterate on it every day, along side the services and infrastructure we build. Everyone does it, and we all benefit.

Page 89: Velocity NY 2014: Signal through the noise (with presenter notes)

We improve it, and iterate on it every day, along side the services and infrastructure we build. Everyone does it, and we all benefit.

Page 90: Velocity NY 2014: Signal through the noise (with presenter notes)

Of course, we still measure things like cpu utilization, but our notifications trigger from the things we care about, when they hit limits that threaten them, like service latency, queue sizes

Page 91: Velocity NY 2014: Signal through the noise (with presenter notes)

and like in this example, the rate of data I/O between services

Page 92: Velocity NY 2014: Signal through the noise (with presenter notes)

Our notifications are delivered into group chat, where they do not jarringly interrupt anyones workflow, and where we can talk about them in a group context. This is a huge win because it eliminates redundant errort. If they aren’t critical they’ll stay there until someone gets to them.

Page 93: Velocity NY 2014: Signal through the noise (with presenter notes)

If they’re critical, they’re escalated using a service that understands who is supposed to be notified per an agreed upon on-call schedule…

Page 94: Velocity NY 2014: Signal through the noise (with presenter notes)

and we require them to be acknowledged. But…

Page 95: Velocity NY 2014: Signal through the noise (with presenter notes)

if they’re important enough to escalate, they include run-book URL’s, and links to graphs of the metric that triggered them

Page 96: Velocity NY 2014: Signal through the noise (with presenter notes)

What did he just say?• Choose metrics that tell you about the things you care about.

•Alert when the things you care about hit limits you understand

•All alerts < critical go to chatrooms, ticket systems or dashboards

•Critical alers use an automated escalation service that enforces on call policy

•Escalated alerts require acknowledgement

•Escalated alerts require run book url’s and/or links to graphs of the metric

and I’ll recap that because it’s important stuff that even the people who unwisely blew through their conference budget or who happen to live in Europe deserve to see. But seriously I do want to talk a little bit about the last bullet here,

Page 97: Velocity NY 2014: Signal through the noise (with presenter notes)

ALERT ON WHAT YOU SEE

we’re able to include a link to a live graph of the metric that triggered each notification because we’re using the same input signal for both alerting and visualization

Page 98: Velocity NY 2014: Signal through the noise (with presenter notes)

because, if you use disparate signals for this, one system that polls and notifies, and another that collects and graphs, you invite alerts that don’t correlate with your metric data. This is absolutely the best way to undermine the credibility of all of your monitoring tools.

Page 99: Velocity NY 2014: Signal through the noise (with presenter notes)

We wouldn’t design an EKG this way. We want one signal that’s as reliable as possible. And building a reliable monitoring system is hard, but it’s easier than making two monitoring systems agree with each other in every case. In the real world, you just can’t make disparate signals reliable, you either need to replace them, or combine them.

Page 100: Velocity NY 2014: Signal through the noise (with presenter notes)

so this situation, where my data doesn’t agree with my notification should never happen. Your data should trigger your notification.

Page 101: Velocity NY 2014: Signal through the noise (with presenter notes)

EVERYONE OWNS ALERTS(and dashboards)

Finally, one thing I couldn’t show you from our chatoom history is that our alerts are created and maintained by the people who receive them. So not only do we rely on everyone to choose and collect metrics about the things we care about,

Page 102: Velocity NY 2014: Signal through the noise (with presenter notes)

We’ve also empowered them to take ownership of the way their feedback is presented and used to trigger alerts. This makes sense because the people who built the thing we care about, are the best qualified to interpret the feedback from it

Page 103: Velocity NY 2014: Signal through the noise (with presenter notes)

Every time we see red in a dashboard, or get paged in the middle of the night, those Alerts should refer to data the recipient is familiar with. Every alert I get should document behavior that violates my notion of ‘healthy’ for the thing we care about..

Page 104: Velocity NY 2014: Signal through the noise (with presenter notes)

it should trigger in me a specific notion of how the thing I care about is threatened. If it doesn’t, if it just tells me some overly specific information about a disk somewhere that may or may not threaten the thing I care about, I should either delete the alert, or fix it

Page 105: Velocity NY 2014: Signal through the noise (with presenter notes)

So once you have a good telemetry stream, absolutely put the cannon in the hands of the alerting victims, because they will help you fix it. They’re incentivized to make it fire less, because they’re in its crosshairs.

Page 106: Velocity NY 2014: Signal through the noise (with presenter notes)

The Ultimate Recap• Enforce a notification policy that requires context

• Make monitoring an engineering process

• Use the same signal for all metrics introspection and notification

• Encourage everyone to rely on telemetry data (graphs or it didn’t happen!)

• Everyone who collects a metric, gets keys to dashboard and alert design

And reached the ultimate recap slide, I hope you’ve enjoyed our time together, and I’ll leave you with following bits of advice: Use policy to make alerts expensive, Make monitoring part of every engineering process, Don't graph everything—Choose metrics teach you about the things you care about, Create a show me the data culture, alert on what you draw, and give the cannon to its targets.

Page 107: Velocity NY 2014: Signal through the noise (with presenter notes)

Questions?Office Hours: 1:15pm

and now for my absolute favorite part of public speaking, the question and answer portion. I invite you to come up to a microphone, where your ridicule and angry diatribe will be amplified for everyone to hear.