30
Monitoring to the Nth (tier) ...or, State of Distributed Tracing 2016 Dan Kuebrich CTO AppNeta @dkuebric

Monitoring to the Nth tier: The state of distributed tracing in 2016

  • Upload
    appneta

  • View
    272

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Monitoring to the Nth tier: The state of distributed tracing in 2016

Monitoring to the Nth (tier)...or, State of Distributed Tracing 2016

Dan KuebrichCTO AppNeta@dkuebric

Page 2: Monitoring to the Nth tier: The state of distributed tracing in 2016

Outline

● What is distributed tracing?

● Who’s doing it, and how?

● Challenges, and future directions?

Page 3: Monitoring to the Nth tier: The state of distributed tracing in 2016
Page 4: Monitoring to the Nth tier: The state of distributed tracing in 2016

● Frontend web app: PHP

● Text search: lucene-based, via thrift

● Pricing service: erlang, via thrift

● Content provider search: ruby, via thrift

● Spelling corrector: python bindings around xapian, via thrift

● ...

Thrift Shop

Page 5: Monitoring to the Nth tier: The state of distributed tracing in 2016

cache(memcached)

search (lucene)

cache(memcached)

app1

ApachePHP

app1

ApachePHP

fw1

perlbal

cache(memcached)

fw2

perlbal

...

search (lucene)

db2

Mysql

search (lucene)

app server

ApachePHP

search (lucene)

search (lucene)

API search (ruby)

pricing (erlang)

spelling (python)

APIs

APIs

db1

Mysql

Page 6: Monitoring to the Nth tier: The state of distributed tracing in 2016

Q: Why do you remember this so well?

Page 7: Monitoring to the Nth tier: The state of distributed tracing in 2016

Q: Why do you remember this so well?

A: ops

Page 8: Monitoring to the Nth tier: The state of distributed tracing in 2016

“Close enough” architectural diagram

https://www.flickr.com/photos/clonedmilkmen/3604999084

Page 9: Monitoring to the Nth tier: The state of distributed tracing in 2016

Things we had

● Ganglia

● Nagios

● Thrift

○ Per-service status page

○ Service status page

● Logs

Page 10: Monitoring to the Nth tier: The state of distributed tracing in 2016

1. Hit refresh N times -- how many times were problematic?

2. Are any services outright down?

3. Systematically tail the logs of every service on every machine

4. Check mysql running processes

5. SSH in and poke around

6. Deploy debug logging

7. Pray

Sample debug workflow

Page 11: Monitoring to the Nth tier: The state of distributed tracing in 2016

X-Trace

Page 12: Monitoring to the Nth tier: The state of distributed tracing in 2016

Instrumentation points and request flow

Web server

Application

Web server

Application

Web server

Application

Database

Service

Load balancer

Cache

3rd party API

Page 13: Monitoring to the Nth tier: The state of distributed tracing in 2016
Page 14: Monitoring to the Nth tier: The state of distributed tracing in 2016

Spans

Page 15: Monitoring to the Nth tier: The state of distributed tracing in 2016

Spans

Page 16: Monitoring to the Nth tier: The state of distributed tracing in 2016

Great minds…Distributed tracing based on ID propagation

● Google Dapper (200x? Published paper 2010)● Twitter Zipkin (Open-sourced 2012)● Etsy (2014ish)● Others

Commercial APM -- some distributed tracing

● New Relic● AppDynamics● DynaTrace

Page 17: Monitoring to the Nth tier: The state of distributed tracing in 2016

Instrumentation points and request flow

Web server

Application

Web server

Application

Web server

Application

Database

Service

Load balancer

Cache

3rd party API

Page 18: Monitoring to the Nth tier: The state of distributed tracing in 2016

Challenges: Instrumentation Points

def interesting_method():

log_entry(...)

_do_stuff()

log_exit(...)

Page 19: Monitoring to the Nth tier: The state of distributed tracing in 2016

OpenTracing● Problematic to tie instrumentation to tracing system

● There is no one system that’s perfect for everyone

● So instrumentation that ties you to a system is bad● Either have it be automatically injected (industry)● … or obey a common interface so it’s pluggable

● OpenTracing v1 goal: provide the interface for portable instrumentation

Page 20: Monitoring to the Nth tier: The state of distributed tracing in 2016

Challenges: Trace ID Propagation

def http_rpc_call():

log_entry(...)

_do_get(modified_headers, ...)

log_exit(...)

Page 21: Monitoring to the Nth tier: The state of distributed tracing in 2016

def interesting_method(trace_id):

log_entry(trace_id, ...)

_do_stuff()

log_exit(trace_id, ...)

Challenges: Trace ID Propagation

Page 22: Monitoring to the Nth tier: The state of distributed tracing in 2016

Challenges: Extracting Value

Page 23: Monitoring to the Nth tier: The state of distributed tracing in 2016

Distributed tracing “only”

● Follow request flow through application● Understand end-to-end latency● Associate backend load with frontend

requests● Provide errors with distributed context

But... as long as you’re in there...

● Latency of queries, RPC calls, in each tier● Slow code● Cache hit/miss ratio● Errors and exceptions● Custom tagging/categorization of data● ...

Rich data set

Page 24: Monitoring to the Nth tier: The state of distributed tracing in 2016
Page 25: Monitoring to the Nth tier: The state of distributed tracing in 2016
Page 26: Monitoring to the Nth tier: The state of distributed tracing in 2016
Page 27: Monitoring to the Nth tier: The state of distributed tracing in 2016
Page 28: Monitoring to the Nth tier: The state of distributed tracing in 2016
Page 29: Monitoring to the Nth tier: The state of distributed tracing in 2016

Context propagation: beyond performance

● Baggage

● Deadlines

● Auth/load attribution

● Flow control?

Page 30: Monitoring to the Nth tier: The state of distributed tracing in 2016

OFFICE HOURS

3pm

MORE INFO

Booth #713 & back of the room

@dkuebric

Thanks!