Velocity NY 2014: Signal through the noise (with presenter notes)

hi.

Hi everybody

[email protected]

@davejosephsen

github: djosephsen

How We Computer

i’m dave josephsen. I’m the developer evangelist at librato, and, I’ll also mention

mailto:[email protected]

[email protected]

@davejosephsen

github: djosephsen

How We Computer

since this is an orielly conference, I’m one of the co-authors of the orielly book on ganglia, which by the way has nothing to do with brain tumors.


[email protected]

@davejosephsen

github: djosephsen

How We Computer

I’ve actually written several books which is exciting because they all have squiggly things on the cover. So not only am I batting 100 in this regard I am the worlds foremost expert, if you have a book project and plan to have a squiggly thing on the cover, you’ll find my contact info just there.


[email protected]

@davejosephsen

github: djosephsen

Signal Through the Noise

I’m here today to talk to you about alerting, and how to make alerting better mainly by improving the input signal that you emit to your notification system


and so our journey begins in the wee early morning ours of the predawn. You’re in bed, happily dreaming.

And in your dream you come across a tree that is growing Tracy Chapman, and you’re like sweet, I love tracy chapman but you’re

a good person, and you don’t want to bogart all the Tracy Chapman, you pick a small one for yourself. And you’re like ohmygod you ROCK miniature Tracy Chapman

And miniature Tracy Chapmann looks down at her guitar, and smiles in that knowing way she does, and begins to sing

but instead of sultry blues, nothing but a horrible combination of buzzing and ringing comes out, and then she drops her guitar and starts smacking you in the face. But it’s not your dream face, it’s your real face, because you realize you’re dreaming, so you struggle to open your eyes

and find that your cat is sitting on your neck, pummeling you repeatedly in the face and your phone is making this horrible buzzing/ringing noise. Why is your phone going off in the middle of the night? Has something horrible happened? Is someone hurt? Is it the end of the world? So now you have a huge dump of adrenalin and fear, and you clumsily shoo the cat away, knocking your lamp off the table in the process, so without the benefit of light, you grope blindly for your phone, ripping it from the charger, and barely manage to get it unlocked to stop the noise,

WAT?

only to be presented with this. It takes you a few moments to understand what you’re seeing. It looks like a web balancer is complaining about some of its hosts being unresponsive. 2nd adrenaline What if you can’t figure out what’s going on? What if you can’t fix it? What if this is just the first of a deluge of alerts signaling the death of your entire infrastructure?

WAT?

So now, conscious enough to locate and grab your laptop, you rip it open and check the various graphs you have at your disposal. Everything looks normal on the bandwidth graph, so you begin to ssh in to the balancer to see if you can tell which hosts are actually down, when you get..

AAAGHHHHH!!!

a recovery notification from your monitoring system. yeah, sorry. False alarm are you kidding me?! 4:30 in the morning. Your heart is beating heavily in your chest. You are wired, angry, and wide awake, and your morning alarm will go off in 2.5 hours. You don’t have time to calm down and get any meaningful sleep. and your productivity will be measurably impaired for the rest of the day

ALERTS AREN’T FREE

So the first point I’d like to make is, alerts are actually really expensive. Not only do they hurt people and disrupt peoples lives, they’re a huge burden to productivity.

Business Projects

IT Projects

Changes

Unplanned Work

If you’ve studied the Gene Kim school of devops , you know there are four types of work, and that one among them, unplanned work, is the most maligned.

Unplanned Work

(eeew Comic Sans)

unplanned work is basically toxic. It disrupts every other kind of work, and causes good people undue grief. If it were a font, it would be Comic Sans. and so if you could take

Unplanned Work

So if you could take unplanned work, and load it into a bullet

Unplanned Work

and then load that bullet into a cannon, and shoot it at happy, and otherwise productive people

Alerting

That’s basically what we’re doing with Alerting. We’re packaging up unplanned work and launching it at people like a bunch of angsty romans with a trebuche.

Tax the Ammunition

and we do it without giving it a second thought our tools make it cheap and easy to fire alerts, so maybe we should make the bullets more expensive

and this turns out to be a pretty good idea, we can reduce the amount of unplanned work by increasing the cost of individual alerts. All we need to do is require that every alert include a run-book URL.

I can personally attest that work very well. As part of the process of building software at librato, we make run books in markdown, commit them to a github repo and link to them from our alerts. They include firefighting and design info and links to things like

visualizations that point back to the metrics that caused the notification

THE CONTENT OF YOUR ALERTS MATTERS

So to start, with some policy enforcement, we can reduce the number of alerts we send, and if you’re looking for a quick fix that can have a meaningful impact on alert quality this is a good start, but really that’s just the tip of the iceberg

What did he just say?

•Notifications are expensive, they hurt people and productivity

•Make people work harder to send them by requiring run books

•Run books add context to alerts. Other types of context are awesome too

•Like graphs

I hope you don’t mind but I’ve taken the liberty of interspersing a few recap slides throughout this presentation for the benefit of the people who are going to see this on slide share and be all “was that a tracy chapmann tree?”

WHY do we Monitor?

So anyhow, run books are a good start but how do we fix this? well lets consider our motivation. What are we trying to accomplish that’s important enough to invade someones dreams and interrupt their workflow? Why do we monitor?

And I think for most of us, it’s really simple. We have this thing we care about, just like any other engineering discipline where the thing

Might be a complicated machine, like a satellite, or

or maybe it’s organic like a human heart

but whatever our thing is, it has to interact with the real world, and therefore we can’t fully control what happens to it.

Telemetry Data

Command Signal

so we use engineering as a means of getting a steady stream of feedback from the thing we care about, so that we can be sure that it’s operating within the boundaries we think are healthy.

and obviously, because systems are different, these characteristics will vary, and maybe we’ll use different techniques to collect that data.

1. Identify Operational LimitationsY<160bpm

X<7m km/h

but in every case our underlying strategy is the same. First we consider the operational limitations of the thing we care about— That it should beat no more than 160 times a minute, or that its velocity should not exceed 7million km/h

2. Monitor those limitations1. Identify Operational Limitations

Y<160bpmX<7m km/h

and then we engineer the system to provide us the feedback we need to detect when those limits are reached.

So taking look at this alert we got at 4 in the morning,

A Balancer ?!

The first thing we’ve done wrong is misidentifying the thing we care about. In web operations, grown-ups care about websites, not arbitrary individual balancers, but yet somehow this balancer has become the focus of our monitoring

Balancer

>66% Host Availability

which brings me to problem 2, which is that we’ve chosen a silly metric for the thing we don’t care about

Balancer

>66% Host Availability

By saying if 34% of the eleventybillion ephemeral server instances behind this balancer go down, all productivity must cease. I mean, I suspect that this metric has been chosen for us

% IO per instance

because Even if I’m running a balancer as a service company and balancing is the thing I care about, host availability is a derpy metric, because

%hosts alive

% IO per instanceVS

(Hint: one of these things measures balancing)

it doesn’t tell me how good a job I’m doing at balancing things. I’d rather know the ratio of IO per back-end instance. Because it tells me about the thing I care about

%hosts alive

% IO per instance

Does not measure balancing Measures balancing

66 .2VSXIO per host is a metric I can use in a threshold. I can say, If %IO per host goes above .2 then shits probably going sideways. So this was a horrible choice of both, the thing we care about, and a metric for that thing

Then, to top it off, instead of engineering the thing we don’t care about to provide feedback for our silly metric, we’ve tasked an external computer to check every few minutes or so and then

and punch us in the face whenever our absurd metric cross a threshold. As far as I can tell, this approach is unique to our field.

If we were aerospace engineers, this would be like crafting a sketchy instrument panel that only updated every 5 minutes and paying some teenager to watch it for us.

but aerospace engineers are doing something very different from what many of us do in IT today. They’re

they’re reasoning about the limitations of the things they care about, and then engineering those things to provide telemetry feedback to their operators. Are we beginning to see a theme emerge?

IT Monitoring != Feedback

So the fundamental observation I’m making here is this thing we do that we call monitoring

IT Monitoring != Feedback

is not the same as what other engineers in other disciplines are doing when they talk about monitoring

some silly balancer!=

And that’s the real problem, because this notion we have that monitoring is something distinct from our engineering pursuits — that building the things we care about goes to this team — and monitoring things in general goes to that team, causes derby and stupid feedback loops.

WE CAN REDUCE ALERTS BY IMPROVING OUR TELEMETRY

SIGNAL

So I think, if we reintegrate monitoring with our day-to-day engineering activities, and we carefully choose metrics that give us the feedback we need, our alerting will improve along with our input signal.

What did he just say?•Monitoring isn't a thing. It’s just part of the engineering process

•We’re treating it like a thing that only some types of engineers might want to do, and that’s giving us broken feedback

•Aerospace engineers are rad, they don’t do that.

•Fix your monitoring and your alerts will follow

And I’ll pause there to give you a moment to reflect on what you’ve just heard, and remind you that it’s ok if you can’t read these recap slides, because they’re here for the slide-share people who couldn’t afford to come here.

If you think I’m dreaming about this whole reintegrate monitoring thing, let me show you what a bad day looks like inside Librato. A couple months ago this happened.

We’re a chatops shop, so we got these two notifications from a chatbot in campfire. The first one says our API is throwing an exception because it’s not able to queue a data point, and the second says that one of our RDS instances has too many persistent connections, so it’s telling new clients to go away.

So as engineers when we see a problem like this we naturally envision the pieces of the thing we care about

or at least the pieces of it that are relevant to the problem that we’re considering

in this case we have an incoming blob of json, that's gotten lodged in the api

and we have an RDS that's unresponsive. and we can’t help but ask ourselves, do these problems correlate? does the API talk to RDS for anything?

and it turns out, yes it does, very early on the API interacts with the rds to do some name-to-uid translation, and because the res is not responding, the api is unable to queue this datapoint. And this intuitive leap that we just made,

this imagining the pieces and their interaction in the context of the problem is pretty much what happens in the head of every engineer who knows this system and sees this problem. And our collective mental model for this problem inevitably implies a few followup questions

Is this problem affecting end-users?

Is it backing up the SLB?

or blocking the unicorns?

Are the RDS read replicas also timing out? Because maybe we can point the api at a replica to keep this ship from sinking

Own YOUR problem

But at Librato, rather than reaching for ssh terminals to answer these questions, the next thing we see in chat

Own YOUR problem

is that our Ops chief has put some graphs up for us in the war room.

Some Graph in the War Room

heading over there we see various engineers have already begun to fill in these gaps for us because…

by default we share a common vision of the problem, so we also share the questions, and so we can answer them collaboratively by sharing telemetry data

Yes this is affecting end users because we see a huge drop in http 200's

yes this is affecting the SLB because we see a huge latency spike..

yes this is derping the unicorns because we see the same latency in the api

and read replicas wont save us, because this problem is affecting their ability to replicate


So as the problem persists our engineers ask more questions,


each of which are answered by metric data, which beg additional questions which get answered by metric data in a feedback loop until they reach a point where some meaningful action can take place. In this case that action wound up being rolling back a deployment that introduced a derpy api query

WHAT YOU MONITOR MATTERS

At Librato, this is what problem solving looks like every single time. We get notifications of actual trouble, and we rely on the same telemetry data that triggered those notifications to solve our problem. We can do this because we know what we care about, and we’ve engineered those things to provide us feedback.

And it’s so simple to pull off. When we build something like a worker

that’s going to consume some input from another service

queue the rest, we inevitably wind up codifying a set of assumptions; operational boundaries that keep this worker happy.

a } < x

C

And these might be things like process A requires that service C responds with a 99th percentile latency below X

} < x

b

kxa

Or that process B requires that the queue never exceeds K elements.

xk

xk

xk

Whenever our process depends on assumptions like these, when the thing we care about is threatened by these limits, we build-in instrumentation to measure and report them to a common upstream metrics repository. We don’t assign a team to commit to a tool, that’s going to dictate or even influence our metric choices. Instead,

EVERYBODY OWNS MONITORING

We empower and expect everyone to measure the things we care about, in the best possible way, and we make it easy to store the result together in the same place. Monitoring is literally everyones responsibility who builds a thing we care about, be it infrastructure or applications.

As my CTO Joe is fond of saying, we have a show-me-the-graph culture. Whenever we make a hypothesis about production behavior, our metrics naturally become part of the conversation. No one points a finger at anything without having the data to back it up. In this way…

Data guides our intuition, because we simply have no choice, it’s there, everybody knows how to use it, and there’s just no fooling it. Telemetry feedback is anathema to speculation, it confirms our assertions, discredits them utterly, or Identifies metrics that we should be tracking in the things we care about.

and I don’t think any of us would trade it for the world, because it’s like this ever-present rock we can stand on when we’re struggling to understand the behavior the things we care about in the wild. Weather we’re dealing with issues, debugging regressions or shipping features it tells us, without a doubt what’s happening to the things we care about

So our metrics are dear to us. We value and nurture the metrics we choose. We share them with each other in the form of prized insight. When we add new ones we do so deliberately and only because they teach us about the things we care about

So for us, Monitoring isn’t a separate activity anymore. It’s just part of correctly building the things we care about..

We improve it, and iterate on it every day, along side the services and infrastructure we build. Everyone does it, and we all benefit.

We improve it, and iterate on it every day, along side the services and infrastructure we build. Everyone does it, and we all benefit.

Of course, we still measure things like cpu utilization, but our notifications trigger from the things we care about, when they hit limits that threaten them, like service latency, queue sizes

and like in this example, the rate of data I/O between services

Our notifications are delivered into group chat, where they do not jarringly interrupt anyones workflow, and where we can talk about them in a group context. This is a huge win because it eliminates redundant errort. If they aren’t critical they’ll stay there until someone gets to them.

If they’re critical, they’re escalated using a service that understands who is supposed to be notified per an agreed upon on-call schedule…

and we require them to be acknowledged. But…

if they’re important enough to escalate, they include run-book URL’s, and links to graphs of the metric that triggered them

What did he just say?• Choose metrics that tell you about the things you care about.

•Alert when the things you care about hit limits you understand

•All alerts < critical go to chatrooms, ticket systems or dashboards

•Critical alers use an automated escalation service that enforces on call policy

•Escalated alerts require acknowledgement

•Escalated alerts require run book url’s and/or links to graphs of the metric

and I’ll recap that because it’s important stuff that even the people who unwisely blew through their conference budget or who happen to live in Europe deserve to see. But seriously I do want to talk a little bit about the last bullet here,

ALERT ON WHAT YOU SEE

we’re able to include a link to a live graph of the metric that triggered each notification because we’re using the same input signal for both alerting and visualization

because, if you use disparate signals for this, one system that polls and notifies, and another that collects and graphs, you invite alerts that don’t correlate with your metric data. This is absolutely the best way to undermine the credibility of all of your monitoring tools.

We wouldn’t design an EKG this way. We want one signal that’s as reliable as possible. And building a reliable monitoring system is hard, but it’s easier than making two monitoring systems agree with each other in every case. In the real world, you just can’t make disparate signals reliable, you either need to replace them, or combine them.

so this situation, where my data doesn’t agree with my notification should never happen. Your data should trigger your notification.

EVERYONE OWNS ALERTS(and dashboards)

Finally, one thing I couldn’t show you from our chatoom history is that our alerts are created and maintained by the people who receive them. So not only do we rely on everyone to choose and collect metrics about the things we care about,

We’ve also empowered them to take ownership of the way their feedback is presented and used to trigger alerts. This makes sense because the people who built the thing we care about, are the best qualified to interpret the feedback from it

Every time we see red in a dashboard, or get paged in the middle of the night, those Alerts should refer to data the recipient is familiar with. Every alert I get should document behavior that violates my notion of ‘healthy’ for the thing we care about..

it should trigger in me a specific notion of how the thing I care about is threatened. If it doesn’t, if it just tells me some overly specific information about a disk somewhere that may or may not threaten the thing I care about, I should either delete the alert, or fix it

So once you have a good telemetry stream, absolutely put the cannon in the hands of the alerting victims, because they will help you fix it. They’re incentivized to make it fire less, because they’re in its crosshairs.

The Ultimate Recap• Enforce a notification policy that requires context

• Make monitoring an engineering process

• Use the same signal for all metrics introspection and notification

• Encourage everyone to rely on telemetry data (graphs or it didn’t happen!)

• Everyone who collects a metric, gets keys to dashboard and alert design

And reached the ultimate recap slide, I hope you’ve enjoyed our time together, and I’ll leave you with following bits of advice: Use policy to make alerts expensive, Make monitoring part of every engineering process, Don't graph everything—Choose metrics teach you about the things you care about, Create a show me the data culture, alert on what you draw, and give the cannon to its targets.

Questions?Office Hours: 1:15pm

and now for my absolute favorite part of public speaking, the question and answer portion. I invite you to come up to a microphone, where your ridicule and angry diatribe will be amplified for everyone to hear.

Technology

Velocity NY 2014: Signal through the noise (with presenter notes)