32
Actionable Metrics Enabling Decision-Making in Netflix’s Decentralized Environment #lspe June 27, 2012 Roy Rapoport @royrapoport, rsr@netflix.com 1

LSPE Presentation: Actionable Metrics at Netflix

Embed Size (px)

DESCRIPTION

2012-06-27 presentation to the Large Scale Production Engineering meetup on Netflix's approach to metrics

Citation preview

Page 1: LSPE Presentation: Actionable Metrics at Netflix

Actionable MetricsEnabling Decision-Making in Netflix’s Decentralized

Environment

#lspe June 27, 2012Roy Rapoport

@royrapoport, [email protected]

1

Page 2: LSPE Presentation: Actionable Metrics at Netflix

Me

• Been in tech for about 20 years

• Systems engineering, networking, software development, QA, release management

• Time at Netflix: 1094 days (3y-2d)

• (Current) job at Netflix: Make things better (Security Monkey, Python Platform, Central Alert Gateway, ... )

2

Page 3: LSPE Presentation: Actionable Metrics at Netflix

Metrics Humor

3

I want to start with a joke. This is may be the world’s longest joke, given that I started telling it about a year ago.I had just attended a presentation at Velocity 2011 where someone said “collect all the metrics you possibly can, because you don’t know what will prove useful.”

Exposing the timeline, see that I’ve got more than a year’s information here.

And that the numbers are actually constrained within a very small range.

And that I’m showing the percent of our instances in our production environment that have even -- that is, divisible by two -- public IP addresses.

Page 4: LSPE Presentation: Actionable Metrics at Netflix

Metrics Humor

3

I want to start with a joke. This is may be the world’s longest joke, given that I started telling it about a year ago.I had just attended a presentation at Velocity 2011 where someone said “collect all the metrics you possibly can, because you don’t know what will prove useful.”

Exposing the timeline, see that I’ve got more than a year’s information here.

And that the numbers are actually constrained within a very small range.

And that I’m showing the percent of our instances in our production environment that have even -- that is, divisible by two -- public IP addresses.

Page 5: LSPE Presentation: Actionable Metrics at Netflix

Metrics Humor

% of instances with even public IP addresses

3

I want to start with a joke. This is may be the world’s longest joke, given that I started telling it about a year ago.I had just attended a presentation at Velocity 2011 where someone said “collect all the metrics you possibly can, because you don’t know what will prove useful.”

Exposing the timeline, see that I’ve got more than a year’s information here.

And that the numbers are actually constrained within a very small range.

And that I’m showing the percent of our instances in our production environment that have even -- that is, divisible by two -- public IP addresses.

Page 6: LSPE Presentation: Actionable Metrics at Netflix

Technology Overview

4

Going into the cloud, went from heavy multi-purpose stacks to a highly distributed SOA environment.

Tons of different services, dynamic binding and communication (no ESB)

Page 7: LSPE Presentation: Actionable Metrics at Netflix

Technology Overview• SoA, REST, Mostly Java

4

Going into the cloud, went from heavy multi-purpose stacks to a highly distributed SOA environment.

Tons of different services, dynamic binding and communication (no ESB)

Page 8: LSPE Presentation: Actionable Metrics at Netflix

Technology Overview• SoA, REST, Mostly Java

• Simple overall architecture:

4

Going into the cloud, went from heavy multi-purpose stacks to a highly distributed SOA environment.

Tons of different services, dynamic binding and communication (no ESB)

Page 9: LSPE Presentation: Actionable Metrics at Netflix

Technology Overview• SoA, REST, Mostly Java

• Simple overall architecture:

4

Going into the cloud, went from heavy multi-purpose stacks to a highly distributed SOA environment.

Tons of different services, dynamic binding and communication (no ESB)

Page 10: LSPE Presentation: Actionable Metrics at Netflix

Culture Overview

5

Freedom and Responsibility means hiring smart people and assuming they’ll make smart decisions. We want to empower them to do whatever the heck they think they need to be doing to make the business succeed, which also means giving them all the tools THEY say they need to be successful. We bias for innovation speed rather than safety, and we try to create highly aligned, but very loosely coupled, teams.

Distributed ops means No NOC. No staring at dashboards. CORE - Cross-functional, tools builders

Bottom line, it’s about getting out of the way of developers. We give them the tools they need, and the infrastructure pieces to make building whatever they want to build easier, and then stand back and let them create whatever they want -- even and especially when what they want to build is, for example ...

A small moon.

Page 11: LSPE Presentation: Actionable Metrics at Netflix

Culture Overview

• Freedom and Responsibility

5

Freedom and Responsibility means hiring smart people and assuming they’ll make smart decisions. We want to empower them to do whatever the heck they think they need to be doing to make the business succeed, which also means giving them all the tools THEY say they need to be successful. We bias for innovation speed rather than safety, and we try to create highly aligned, but very loosely coupled, teams.

Distributed ops means No NOC. No staring at dashboards. CORE - Cross-functional, tools builders

Bottom line, it’s about getting out of the way of developers. We give them the tools they need, and the infrastructure pieces to make building whatever they want to build easier, and then stand back and let them create whatever they want -- even and especially when what they want to build is, for example ...

A small moon.

Page 12: LSPE Presentation: Actionable Metrics at Netflix

Culture Overview

• Freedom and Responsibility

• Distributed Operations

5

Freedom and Responsibility means hiring smart people and assuming they’ll make smart decisions. We want to empower them to do whatever the heck they think they need to be doing to make the business succeed, which also means giving them all the tools THEY say they need to be successful. We bias for innovation speed rather than safety, and we try to create highly aligned, but very loosely coupled, teams.

Distributed ops means No NOC. No staring at dashboards. CORE - Cross-functional, tools builders

Bottom line, it’s about getting out of the way of developers. We give them the tools they need, and the infrastructure pieces to make building whatever they want to build easier, and then stand back and let them create whatever they want -- even and especially when what they want to build is, for example ...

A small moon.

Page 13: LSPE Presentation: Actionable Metrics at Netflix

Culture Overview

• Freedom and Responsibility

• Distributed Operations

•Get out of the way of Developers

5

Freedom and Responsibility means hiring smart people and assuming they’ll make smart decisions. We want to empower them to do whatever the heck they think they need to be doing to make the business succeed, which also means giving them all the tools THEY say they need to be successful. We bias for innovation speed rather than safety, and we try to create highly aligned, but very loosely coupled, teams.

Distributed ops means No NOC. No staring at dashboards. CORE - Cross-functional, tools builders

Bottom line, it’s about getting out of the way of developers. We give them the tools they need, and the infrastructure pieces to make building whatever they want to build easier, and then stand back and let them create whatever they want -- even and especially when what they want to build is, for example ...

A small moon.

Page 14: LSPE Presentation: Actionable Metrics at Netflix

Culture Overview

• Freedom and Responsibility

• Distributed Operations

•Get out of the way of Developers

5

Freedom and Responsibility means hiring smart people and assuming they’ll make smart decisions. We want to empower them to do whatever the heck they think they need to be doing to make the business succeed, which also means giving them all the tools THEY say they need to be successful. We bias for innovation speed rather than safety, and we try to create highly aligned, but very loosely coupled, teams.

Distributed ops means No NOC. No staring at dashboards. CORE - Cross-functional, tools builders

Bottom line, it’s about getting out of the way of developers. We give them the tools they need, and the infrastructure pieces to make building whatever they want to build easier, and then stand back and let them create whatever they want -- even and especially when what they want to build is, for example ...

A small moon.

Page 15: LSPE Presentation: Actionable Metrics at Netflix

The Metric Lifecycle

6

Page 16: LSPE Presentation: Actionable Metrics at Netflix

The Metric Lifecycle

• Send

6

Page 17: LSPE Presentation: Actionable Metrics at Netflix

The Metric Lifecycle

• Send

•Look

6

Page 18: LSPE Presentation: Actionable Metrics at Netflix

The Metric Lifecycle

• Send

•Look

•Alert

6

Page 19: LSPE Presentation: Actionable Metrics at Netflix

Systems

• Flexible

• Scalable

• Self-Service

7Developers own sending metricsDevelopers specify what metrics to send

Smart aggregation (critical given churn potential)

Page 20: LSPE Presentation: Actionable Metrics at Netflix

TelemetryFlexible, Scalable, Self-Service

8

On the fly definition of metricsVery low barrier to entryJava, Python, Perl (and if someone wanted to, they could -- pretty easily -- create a bash interface)

Page 21: LSPE Presentation: Actionable Metrics at Netflix

TelemetryFlexible, Scalable, Self-Service

import netflix.metrics as NM[...] self.nm = NM.Metrics("core_cag")[...]def api(self): self.nm.counter("api") [...] app_label = “application_%s” % application self.nm.counter(app_label)[...]

8

On the fly definition of metricsVery low barrier to entryJava, Python, Perl (and if someone wanted to, they could -- pretty easily -- create a bash interface)

Page 22: LSPE Presentation: Actionable Metrics at Netflix

VisualizationFlexible, Scalable, Self-Service

9

GUI helper for creating graphs, URL-driven fetchingThree engines: highcharts, dygraphs, RRDflexible ‘today vs some other time’ capability

Page 23: LSPE Presentation: Actionable Metrics at Netflix

AlertingFlexible, Scalable, Self-Service

10

For some patterns -- e.g. traffic -- you’ve got to have the ability to have your thresholds be dynamic. We played with Holt-Winters, but found that it was expensive, and took too long to calibrate. We’ve found that Double Exponential Smoothing has worked really well for us.

But once you start doing more interesting alerting configuration, it’s harder to know whether or not you’ve set your thresholds correctly -- and the ability to map your alert configuration to historical metric values to see when your alert WOULD have triggered makes a huge difference in making your first attempt at rational threasholds most likely to be successful.

Page 24: LSPE Presentation: Actionable Metrics at Netflix

AlertingFlexible, Scalable, Self-Service

• Static vs Dynamic Thresholds

10

For some patterns -- e.g. traffic -- you’ve got to have the ability to have your thresholds be dynamic. We played with Holt-Winters, but found that it was expensive, and took too long to calibrate. We’ve found that Double Exponential Smoothing has worked really well for us.

But once you start doing more interesting alerting configuration, it’s harder to know whether or not you’ve set your thresholds correctly -- and the ability to map your alert configuration to historical metric values to see when your alert WOULD have triggered makes a huge difference in making your first attempt at rational threasholds most likely to be successful.

Page 25: LSPE Presentation: Actionable Metrics at Netflix

AlertingFlexible, Scalable, Self-Service

• Static vs Dynamic Thresholds

• Historical Testing

10

For some patterns -- e.g. traffic -- you’ve got to have the ability to have your thresholds be dynamic. We played with Holt-Winters, but found that it was expensive, and took too long to calibrate. We’ve found that Double Exponential Smoothing has worked really well for us.

But once you start doing more interesting alerting configuration, it’s harder to know whether or not you’ve set your thresholds correctly -- and the ability to map your alert configuration to historical metric values to see when your alert WOULD have triggered makes a huge difference in making your first attempt at rational threasholds most likely to be successful.

Page 26: LSPE Presentation: Actionable Metrics at Netflix

For Example ...

What the ...

Last 3 hours’ core_tools.core_cag_api

11

core_tools.core_cag_api is the alert volume through our Central Alerting Gateway (CAG). I went to look for this graph for this presentation when noticed we had dropped our volume significantly over the last 20 minutes (the relatively flat last part of this graph). So I expanded the time range to the last few days ...

Page 27: LSPE Presentation: Actionable Metrics at Netflix

For Example ...Visualization (Continued)

Last 4 days’ core_tools.core_cag_api

even more questions!

12

Which just raised more questions -- like what happened with the drop in alerts on Monday, at 11AM and 11PM?

So let’s expand the range further back -- for the last two weeks ...

Page 28: LSPE Presentation: Actionable Metrics at Netflix

For Example ...Visualization (Continued)

Last 10 days’ core_tools.core_cag_api

What caused the spike?

13

OK, that looks like basically we had a spike in alerts starting as of about 10 days ago or so, so the drops on Monday were just going back to normal volume. But what caused the spike earlier?

The good news is that since we send metrics not just for alerts but for alerts per application (see the earlier code example), we could see alert volume per application...

Page 29: LSPE Presentation: Actionable Metrics at Netflix

For Example ...Visualization (Continued)

Show alert volume per application

Someone had a rough few days...

14

The purple line is alerts for one of our applications -- which clearly had had a pretty rough few days.

Now that I had the answers, let’s make sure we alert on alert volume ...

Page 30: LSPE Presentation: Actionable Metrics at Netflix

Don’t Like Surprises...{ "alerts": [ { "applyTo": "cluster", "condition": { "minPercent": 90.0, "noise" : .2, "maxPercent": 25.0, "type": "DoubleExponential" },

// here’s my numberoverrides : {

‘api_key’ : ‘93528d3baa599b727097d73cfdbd5934’}

"metricName": "core_cag_api", // so call me maybe "severity": "major" } ], "clusters": [ "core_tools" ]}

15

Page 31: LSPE Presentation: Actionable Metrics at Netflix

I Didn’t Mention

• End-to-end testing and alerting

• External availability and performance

• Events

• Open Connect

• Jobs

16

Things I didn’t talk about:

* end-to-end testing and visibility into transactions through our system;

* making sure our site’s available to the world;

* how we’ve promoted events into first-class monitored objects in our environment;

* monitoring our new CDN, the Open Connect platform;

* the jobs we have open, at http://jobs.netflix.com

Page 32: LSPE Presentation: Actionable Metrics at Netflix

I Can Haz Question?

(Photo credit: My wife)

17