103
Operational Insight June 15, 2015 Roy Rapoport @royrapoport / linkedin.com/in/royrapoport / [email protected]

Operational Insight: Concepts and Examples

Embed Size (px)

Citation preview

Page 1: Operational Insight: Concepts and Examples

Operational InsightJune 15, 2015 Roy Rapoport

@royrapoport / linkedin.com/in/royrapoport / [email protected]

Page 2: Operational Insight: Concepts and Examples

Oh, The Places We’ll Go!

Today, I want to propose a general framework for how to think about operational insight products and features. I’m hoping that this framework is applicable to anyone who performs operations in production. After I propose thinking about operational insight this way, I’ll demonstrate some applications of it within our own operational environments at Netflix.

Page 3: Operational Insight: Concepts and Examples

The template we were supposed to use had me start with a slide with the speaker bio, but I want to start with something more relevant and interesting to you: The Korean War, and specifically dogfights during the war.

Page 4: Operational Insight: Concepts and Examples

John Boyd

John Boyd was an air force pilot at the time; he studied dogfights and came to the conclusion every fighter pilot went through the same four steps:

Page 5: Operational Insight: Concepts and Examples

Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with conflict against other humans, but I’d like to suggest it has much broader applicability.

Page 6: Operational Insight: Concepts and Examples

Observe

Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with conflict against other humans, but I’d like to suggest it has much broader applicability.

Page 7: Operational Insight: Concepts and Examples

Observe

Orient

Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with conflict against other humans, but I’d like to suggest it has much broader applicability.

Page 8: Operational Insight: Concepts and Examples

Observe

Orient

Decide

Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with conflict against other humans, but I’d like to suggest it has much broader applicability.

Page 9: Operational Insight: Concepts and Examples

Observe

Orient

Decide

Act

Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with conflict against other humans, but I’d like to suggest it has much broader applicability.

Page 10: Operational Insight: Concepts and Examples

Observe

Orient

Decide

Act OODA

Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with conflict against other humans, but I’d like to suggest it has much broader applicability.

Page 11: Operational Insight: Concepts and Examples

Observe

Orient

Decide

Act OODA

“This approach favors agility over raw power in dealing with human opponents in any endeavor” - Wikipedia

Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with conflict against other humans, but I’d like to suggest it has much broader applicability.

Page 12: Operational Insight: Concepts and Examples

This Is What We Do

Because even when not dealing with human opponents, anyone dealing with any aspect of operations — dealing with availability events, making decisions about promoting software in production, or … well, making decisions in general — does this all. the. time.

Page 13: Operational Insight: Concepts and Examples

For example, this pair of graphs represent the two KPIs by which we know if we have a high-level serious problem. The top one is the rate of calls into our customer service group; the second one is the rate at which people are actually streaming. Both are over the last seven days. When these dip …

Page 14: Operational Insight: Concepts and Examples

Like here, for example.

Page 15: Operational Insight: Concepts and Examples

We know we have a problem. We don’t exactly know what’s causing it, or what we’ll do to fix it. We’ll need to understand more about the problem to come to a decision, and then execute on that decision — OODA.

Page 16: Operational Insight: Concepts and Examples

OODA KPI

So if we do OODA all the time, how can we think about doing OODA better or worse? There are three facets we can look at: How fast we execute the loop, how much effort it takes to execute the loop, and how reliably we execute the loop (make the right decision, execute it well).

Page 17: Operational Insight: Concepts and Examples

OODA KPI

Speed

So if we do OODA all the time, how can we think about doing OODA better or worse? There are three facets we can look at: How fast we execute the loop, how much effort it takes to execute the loop, and how reliably we execute the loop (make the right decision, execute it well).

Page 18: Operational Insight: Concepts and Examples

OODA KPI

Speed Effort

So if we do OODA all the time, how can we think about doing OODA better or worse? There are three facets we can look at: How fast we execute the loop, how much effort it takes to execute the loop, and how reliably we execute the loop (make the right decision, execute it well).

Page 19: Operational Insight: Concepts and Examples

OODA KPI

Speed Effort Reliability

So if we do OODA all the time, how can we think about doing OODA better or worse? There are three facets we can look at: How fast we execute the loop, how much effort it takes to execute the loop, and how reliably we execute the loop (make the right decision, execute it well).

Page 20: Operational Insight: Concepts and Examples

Winning

Speed Effort Reliability

So if we were trying to optimize the loop, we’d try to raise speed, drop effort, and raise reliability. speed and reliability are directly relevant to the consistency of your product’s delivery; effort is more relevant to whether or not your people are doing high-value work or not, and whether or not they’re likely going to continue to be happy working for you.

Page 21: Operational Insight: Concepts and Examples

WinningSpeed

Effort Reliability

So if we were trying to optimize the loop, we’d try to raise speed, drop effort, and raise reliability. speed and reliability are directly relevant to the consistency of your product’s delivery; effort is more relevant to whether or not your people are doing high-value work or not, and whether or not they’re likely going to continue to be happy working for you.

Page 22: Operational Insight: Concepts and Examples

WinningSpeed

Effort

Reliability

So if we were trying to optimize the loop, we’d try to raise speed, drop effort, and raise reliability. speed and reliability are directly relevant to the consistency of your product’s delivery; effort is more relevant to whether or not your people are doing high-value work or not, and whether or not they’re likely going to continue to be happy working for you.

Page 23: Operational Insight: Concepts and Examples

WinningSpeed

Effort

Reliability

So if we were trying to optimize the loop, we’d try to raise speed, drop effort, and raise reliability. speed and reliability are directly relevant to the consistency of your product’s delivery; effort is more relevant to whether or not your people are doing high-value work or not, and whether or not they’re likely going to continue to be happy working for you.

Page 24: Operational Insight: Concepts and Examples

Implications … for Observation (aka measurement, telemetry, metrics)

Page 25: Operational Insight: Concepts and Examples

Implications … for Observation (aka measurement, telemetry, metrics)

• Make It Easy

Page 26: Operational Insight: Concepts and Examples

Implications … for Observation (aka measurement, telemetry, metrics)

• Make It Easy• Make It Scalable

Page 27: Operational Insight: Concepts and Examples

Implications … for Observation (aka measurement, telemetry, metrics)

• Make It Easy• Make It Scalable• Make it pluggable

Page 28: Operational Insight: Concepts and Examples

Implications … for Observation (aka measurement, telemetry, metrics)

• Make It Easy• Make It Scalable• Make it pluggable• (Eventually) Ruthlessly Cull

Page 29: Operational Insight: Concepts and Examples

Implications … for Observation (aka measurement, telemetry, metrics)

• Make It Easy• Make It Scalable• Make it pluggable• (Eventually) Ruthlessly Cull

“What decision will this help me make?”

Page 30: Operational Insight: Concepts and Examples

A Joke

I’d like to tell a very very long joke. It started at Velocity 2011, when I heard someone at a presentation “monitor all the things, because you never know what you might find useful one of these days.”

Page 31: Operational Insight: Concepts and Examples

This is a graph representing about 380K datapoints, collected once every five minutes since June 2011. It’s a bit mysterious, I know.

Page 32: Operational Insight: Concepts and Examples

52

48

It may help you to see the lower and upper bounds of this graph are 48 to 52.

Page 33: Operational Insight: Concepts and Examples

% of servers in major region with an even IP address

This graph represents the percent of our cloud instances in a given production region which had a public IP address. We can — and should (and I hope we do) — laugh about this graph, but I’d bet you your monitoring system is chock full of similarly useless data — I know mine is. It impacts the cost of the system, but also literally makes your job — and your customers’ jobs, if you’re responsible for the telemetry system — harder, because there’s much much more chaff to wade through.

Page 34: Operational Insight: Concepts and Examples

Implications … for Orientation (aka graphing, visualization)

Page 35: Operational Insight: Concepts and Examples

Implications … for Orientation (aka graphing, visualization)

• First-class product

Page 36: Operational Insight: Concepts and Examples

Implications … for Orientation (aka graphing, visualization)

• First-class product• Different decisions require different viz

Page 37: Operational Insight: Concepts and Examples

Implications … for Orientation (aka graphing, visualization)

• First-class product• Different decisions require different viz• Low cognitive load better than

Page 38: Operational Insight: Concepts and Examples

Implications … for Orientation (aka graphing, visualization)

• First-class product• Different decisions require different viz• Low cognitive load better than

• High refresh rates

Page 39: Operational Insight: Concepts and Examples

Implications … for Orientation (aka graphing, visualization)

• First-class product• Different decisions require different viz• Low cognitive load better than

• High refresh rates• Deep data density

Page 40: Operational Insight: Concepts and Examples

Better Like This …

Page 41: Operational Insight: Concepts and Examples

Or Better Like That …

Page 42: Operational Insight: Concepts and Examples

Implications … for Decisions (aka alerting, real-time analytics, etc)

Alerts are a basic, primitive decision. Build on that.

Page 43: Operational Insight: Concepts and Examples

Implications … for Decisions (aka alerting, real-time analytics, etc)

• You already have (some of) this

Alerts are a basic, primitive decision. Build on that.

Page 44: Operational Insight: Concepts and Examples

Implications … for Decisions (aka alerting, real-time analytics, etc)

• You already have (some of) this• Incremental improvement

Alerts are a basic, primitive decision. Build on that.

Page 45: Operational Insight: Concepts and Examples

Implications … for Decisions (aka alerting, real-time analytics, etc)

• You already have (some of) this• Incremental improvement• Sky’s the limit

Alerts are a basic, primitive decision. Build on that.

Page 46: Operational Insight: Concepts and Examples

Implications … for Decisions (aka alerting, real-time analytics, etc)

• You already have (some of) this• Incremental improvement• Sky’s the limit

• For benefits

Alerts are a basic, primitive decision. Build on that.

Page 47: Operational Insight: Concepts and Examples

Implications … for Decisions (aka alerting, real-time analytics, etc)

• You already have (some of) this• Incremental improvement• Sky’s the limit

• For benefits• For cost

Alerts are a basic, primitive decision. Build on that.

Page 48: Operational Insight: Concepts and Examples

Implications … for Action

If you’re thinking of creating a runbook, AUTOMATE IT.

Page 49: Operational Insight: Concepts and Examples

Implications … for Action

1. Humans beat bureaucracy

If you’re thinking of creating a runbook, AUTOMATE IT.

Page 50: Operational Insight: Concepts and Examples

Implications … for Action

1. Humans beat bureaucracy2. Machines beat humans

If you’re thinking of creating a runbook, AUTOMATE IT.

Page 51: Operational Insight: Concepts and Examples

Implications … for Action

1. Humans beat bureaucracy2. Machines beat humans3. Repeatability beats one-offs

If you’re thinking of creating a runbook, AUTOMATE IT.

Page 52: Operational Insight: Concepts and Examples

Implications … for Action

1. Humans beat bureaucracy2. Machines beat humans3. Repeatability beats one-offs

Repeatable machine processes TROUNCE one-off human bureaucracy

If you’re thinking of creating a runbook, AUTOMATE IT.

Page 53: Operational Insight: Concepts and Examples

Implications … for Action

1. Humans beat bureaucracy2. Machines beat humans3. Repeatability beats one-offs4. Start with humans

Repeatable machine processes TROUNCE one-off human bureaucracy

If you’re thinking of creating a runbook, AUTOMATE IT.

Page 54: Operational Insight: Concepts and Examples

Implications … for Action

1. Humans beat bureaucracy2. Machines beat humans3. Repeatability beats one-offs4. Start with humans5. If IFTTT, deprecate humans

Repeatable machine processes TROUNCE one-off human bureaucracy

If you’re thinking of creating a runbook, AUTOMATE IT.

Page 55: Operational Insight: Concepts and Examples

Decision: Do I Have Enough

Instances?

So let’s talk about a basic capacity quandry: Do I have enough instances in my cluster?

Page 56: Operational Insight: Concepts and Examples

I showed this graph earlier. Our work volume is highly diurnal. So we could, if we wanted, make sure our cluster sizes are big enough to support peak workload and just deal with the waste when the work load decreases; instead, we use Amazon’s auto-scaling group feature to automatically scale the clusters up and down in response to demand. So instead of trying to give users better telemetry on utilization, and making it easier for them to see if they need to increase capacity, we just automate that decision (allowing them, of course, to override it whenever they want to)

Page 57: Operational Insight: Concepts and Examples

I showed this graph earlier. Our work volume is highly diurnal. So we could, if we wanted, make sure our cluster sizes are big enough to support peak workload and just deal with the waste when the work load decreases; instead, we use Amazon’s auto-scaling group feature to automatically scale the clusters up and down in response to demand. So instead of trying to give users better telemetry on utilization, and making it easier for them to see if they need to increase capacity, we just automate that decision (allowing them, of course, to override it whenever they want to)

Page 58: Operational Insight: Concepts and Examples

Decision: Is My Canary Good?

We use a deployment pattern called canary, where we compare the new version of the software to the baseline, also in production, and seek to answer a very simple question: Is our canary at least as good as our baseline system?

Page 59: Operational Insight: Concepts and Examples

25

Page 60: Operational Insight: Concepts and Examples

Been there.Done that.Manually.Artisanally.

25

Page 61: Operational Insight: Concepts and Examples

Been there.

• Started in the Data Center

Done that.Manually.Artisanally.

25

Page 62: Operational Insight: Concepts and Examples

Been there.

• Started in the Data Center

• Manual, dashboard-driven

Done that.Manually.Artisanally.

25

Page 63: Operational Insight: Concepts and Examples

Been there.Done that.Manually.

26

CPU

Requests

Errors

Page 64: Operational Insight: Concepts and Examples

Been there.Done that.Manually.

27

Page 65: Operational Insight: Concepts and Examples

Been there.Done that.Manually.• Context vs Precision

27

Page 66: Operational Insight: Concepts and Examples

Been there.Done that.Manually.• Context vs Precision

• No …

27

Page 67: Operational Insight: Concepts and Examples

Been there.Done that.Manually.• Context vs Precision

• No …

• Repeatability

27

Page 68: Operational Insight: Concepts and Examples

Been there.Done that.Manually.• Context vs Precision

• No …

• Repeatability

• Trending

27

Page 69: Operational Insight: Concepts and Examples

Been there.Done that.Manually.• Context vs Precision

• No …

• Repeatability

• Trending

• Manual effort is manual

27

Page 70: Operational Insight: Concepts and Examples

So Now What?

28

Page 71: Operational Insight: Concepts and Examples

So Now What?

• Automate Analysis

28

Page 72: Operational Insight: Concepts and Examples

So Now What?

• Automate Analysis

• Took Some Effort

28

Page 73: Operational Insight: Concepts and Examples

So Now What?

• Automate Analysis

• Took Some Effort

• Approach and analytics

28

Page 74: Operational Insight: Concepts and Examples

So Now What?

• Automate Analysis

• Took Some Effort

• Approach and analytics

• Presentation matters

28

Page 75: Operational Insight: Concepts and Examples

Version Control System

1000 servers @ 1.0.1

Customers

Build & Deployment

System

Automated Canary Analysis

Pretty Pictures

29

Page 76: Operational Insight: Concepts and Examples

Version Control System

1000 servers @ 1.0.1

Customers

Build & Deployment

System1 server @ 1.0.2

Automated Canary Analysis

Pretty Pictures

29

Page 77: Operational Insight: Concepts and Examples

10 servers @ 1.0.2Version

Control System

1000 servers @ 1.0.1

Customers

Build & Deployment

System

Automated Canary Analysis

Pretty Pictures

29

Page 78: Operational Insight: Concepts and Examples

1000 servers @ 1.0.2

Version Control System

1000 servers @ 1.0.1

Customers

Build & Deployment

System

Automated Canary Analysis

Pretty Pictures

29

Page 79: Operational Insight: Concepts and Examples

Version

1000 servers @ 1.0.1

Custome

Build & Deployment

Automated

1000 servers @ 1.0.2

Pretty Pictures

30

Version Control System

Build & Deployment

System

Automated Canary Analysis

Customers

Page 80: Operational Insight: Concepts and Examples

Version Custome

Build & Deployment

Automated

1000 servers @ 1.0.2

Pretty Pictures

30

Version Control System

Build & Deployment

System

Automated Canary Analysis

Customers

Page 81: Operational Insight: Concepts and Examples

Version

1000 servers @ 1.0.1

Custome

Build & Deployment

Automated

1000 servers @ 1.0.2

Pretty Pictures

31

Version Control System

Build & Deployment

System

Automated Canary Analysis

Page 82: Operational Insight: Concepts and Examples

Version

1000 servers @ 1.0.1

Custome

Build & Deployment

Automated

1000 servers @ 1.0.2

Pretty Pictures

31

Version Control System

Build & Deployment

System

Automated Canary Analysis

Page 83: Operational Insight: Concepts and Examples

Just The Stats 4-Week View

Page 84: Operational Insight: Concepts and Examples

Just The Stats 4-Week View

6309 canary analysis cycles

Page 85: Operational Insight: Concepts and Examples

Just The Stats 4-Week View

6309 canary analysis cycles16% canaries failed

Page 86: Operational Insight: Concepts and Examples

Decision: Do I Have an Outlier?

Page 87: Operational Insight: Concepts and Examples

Outlier Detection

In an environment where you have a bunch of potentially-undifferentiated resources that should all behave approximately similarly, it becomes easy — and necessary, in a sufficiently large ecosystem — to notice outliers. If your cost for culling the outliers is low, you can also do it automatically. If not, you can at least alert that One Of These Things Is No Longer Like The Others.

Page 88: Operational Insight: Concepts and Examples

Would You Like to Play a Game?

Can I have a volunteer from the audience to run an experiment with me?

Page 89: Operational Insight: Concepts and Examples

Spot the Outlier

So for training, imagine I’m giving you this information about nine servers, named A through I. Each row is a minute’s data for these servers — let’s say it’s load average, or error rates. I’m going to ask you to point out the server — or column — that looks materially different from the others. This should be a relatively easy case, of course. Can you pick the server?

Page 90: Operational Insight: Concepts and Examples

OK. Now, I’m going to time you doing the same with more interesting data.

Didn’t work so well? OK, let’s make it easier to orient and understand the numbers you’re looking at by showing this to you graphed.

Page 91: Operational Insight: Concepts and Examples

OK. Now, I’m going to time you doing the same with more interesting data.

Didn’t work so well? OK, let’s make it easier to orient and understand the numbers you’re looking at by showing this to you graphed.

Page 92: Operational Insight: Concepts and Examples

It probably is easier, isn’t it? Can you easily point out the outlier?

OK, one last test. At the next slide, I’m going to show you some information (you can assume it’s true) and I want you to tell me which is the outlier, OK?

Page 93: Operational Insight: Concepts and Examples

The Outlier Is

“A”That was … much easier, wasn’t it?

This is what happens when we let computers do this work. We could have spent more time and effort to give you a more powerful visualization that would have made it easier to notice the outlier, but we instead built the analytics system that lets us automatically determine outliers so it won’t make it easier for you to do the work — it will do it for you.

Page 94: Operational Insight: Concepts and Examples

Just The Stats 4-Week View

We can use this for anything — pieces of content, or devices, or ISPs. Right now, we’ve been using it for about ten or so clusters of server and in the last four weeks have automatically identified — and terminated — 739 outliers.

Page 95: Operational Insight: Concepts and Examples

Just The Stats 4-Week View

739 Server Terminations

We can use this for anything — pieces of content, or devices, or ISPs. Right now, we’ve been using it for about ten or so clusters of server and in the last four weeks have automatically identified — and terminated — 739 outliers.

Page 96: Operational Insight: Concepts and Examples

In a Nutshell Observe

Orient

Decide

Act

So really, that’s all there is: Start with observation — you need this in order to get anything done. Then focus on the kinds of decisions that need to take place in your operational environment. Where you continue to need humans to make these decisions, work hard to help them orient by providing better ways to understand the data; where you don’t … have machines do it.

Page 97: Operational Insight: Concepts and Examples

In a Nutshell Observe

Orient

Decide

Act

Need This First http://bit.ly/nflx-atlas-2013

http://metrics20.org

So really, that’s all there is: Start with observation — you need this in order to get anything done. Then focus on the kinds of decisions that need to take place in your operational environment. Where you continue to need humans to make these decisions, work hard to help them orient by providing better ways to understand the data; where you don’t … have machines do it.

Page 98: Operational Insight: Concepts and Examples

In a Nutshell Observe

Orient

Decide

Act

Need This First http://bit.ly/nflx-atlas-2013

http://metrics20.org

Understand the decision http://bit.ly/nflx-qcon-aca-2014

So really, that’s all there is: Start with observation — you need this in order to get anything done. Then focus on the kinds of decisions that need to take place in your operational environment. Where you continue to need humans to make these decisions, work hard to help them orient by providing better ways to understand the data; where you don’t … have machines do it.

Page 99: Operational Insight: Concepts and Examples

In a Nutshell Observe

Orient

Decide

Act

Need This First http://bit.ly/nflx-atlas-2013

http://metrics20.org

Understand the decision http://bit.ly/nflx-qcon-aca-2014

Make it easier for humans

So really, that’s all there is: Start with observation — you need this in order to get anything done. Then focus on the kinds of decisions that need to take place in your operational environment. Where you continue to need humans to make these decisions, work hard to help them orient by providing better ways to understand the data; where you don’t … have machines do it.

Page 100: Operational Insight: Concepts and Examples

In a Nutshell Observe

Orient

Decide

Act

Need This First http://bit.ly/nflx-atlas-2013

http://metrics20.org

Understand the decision http://bit.ly/nflx-qcon-aca-2014

Make it easier for humans

Make machinesdo it

So really, that’s all there is: Start with observation — you need this in order to get anything done. Then focus on the kinds of decisions that need to take place in your operational environment. Where you continue to need humans to make these decisions, work hard to help them orient by providing better ways to understand the data; where you don’t … have machines do it.

Page 101: Operational Insight: Concepts and Examples

In a Nutshell Observe

Orient

Decide

Act

Need This First http://bit.ly/nflx-atlas-2013

http://metrics20.org

Understand the decision http://bit.ly/nflx-qcon-aca-2014

Make it easier for humans

Make machinesdo it

Higher speed Lower effort

Higher reliability

So really, that’s all there is: Start with observation — you need this in order to get anything done. Then focus on the kinds of decisions that need to take place in your operational environment. Where you continue to need humans to make these decisions, work hard to help them orient by providing better ways to understand the data; where you don’t … have machines do it.

Page 102: Operational Insight: Concepts and Examples

Questions, Attributions, Feedback

42

Page 103: Operational Insight: Concepts and Examples

Questions, Attributions, Feedback

@[email protected]/in/royrapoport?42