58
@nklmish Distributed tracing - get a grasp on your production “the most wanted and missed tool in the microservice world”

Distributed tracing - get a grasp on your production

  • Upload
    nklmish

  • View
    871

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Distributed tracing - get a grasp on your production

@nklmish

Distributed tracing -

get a grasp on your production

“the most wanted and missed tool in the microservice world”

Page 2: Distributed tracing - get a grasp on your production

@nklmish

Agenda

Why latency ?

Distributed tracing

Short demo

Zipkin & core concepts

Code walkthrough

Page 3: Distributed tracing - get a grasp on your production

@nklmish

Latency

Page 4: Distributed tracing - get a grasp on your production

@nklmish

Every little bit count

Page 5: Distributed tracing - get a grasp on your production

@nklmish

With scale, you see

(source: https://gist.github.com/hellerbarde/2843375)

Page 6: Distributed tracing - get a grasp on your production

@nklmish

Latency?

Page 7: Distributed tracing - get a grasp on your production

@nklmish

User waiting

Page 8: Distributed tracing - get a grasp on your production

@nklmish

Remember, slow pages lose users

Page 9: Distributed tracing - get a grasp on your production

@nklmish

Distributed systems - latency analysis

Page 10: Distributed tracing - get a grasp on your production

@nklmish

Story time: How bob meet longtail latency

Page 11: Distributed tracing - get a grasp on your production

@nklmish

Bob didn’t knew he was suffering from Longtail latency

Page 12: Distributed tracing - get a grasp on your production

@nklmish

Bob trying to troubleshooting longtail latency in distributed system

Page 13: Distributed tracing - get a grasp on your production

@nklmish

Option 1: Log Analysis

Page 14: Distributed tracing - get a grasp on your production

@nklmish

Lots of files

Page 15: Distributed tracing - get a grasp on your production

@nklmish

Looking in logs

Page 16: Distributed tracing - get a grasp on your production

@nklmish

Not everything in critical path.

Page 17: Distributed tracing - get a grasp on your production

@nklmish

Correlating logs, manual works

Page 18: Distributed tracing - get a grasp on your production

@nklmish

It simply doesn’t make sense

Page 19: Distributed tracing - get a grasp on your production

@nklmish

Option 2: What about Metrics?

(source: https://gist.github.com/hellerbarde/2843375)

Page 20: Distributed tracing - get a grasp on your production

@nklmish

Something is wrong

(source: https://gist.github.com/hellerbarde/2843375)

Page 21: Distributed tracing - get a grasp on your production

@nklmish

Can’t tell the cause

(source: https://gist.github.com/hellerbarde/2843375)

?

Page 22: Distributed tracing - get a grasp on your production

@nklmish

Aggregates (avg, stdev) may deceive

(source: https://gist.github.com/hellerbarde/2843375)

Page 23: Distributed tracing - get a grasp on your production

@nklmish

Bob, could we find out how many clients are impacted ?

Page 24: Distributed tracing - get a grasp on your production

@nklmish

Bob learn about percentiles

Page 25: Distributed tracing - get a grasp on your production

@nklmish

Clients impacted by longtail latency…

Percentile: 99th => 1 out of 100 visit experience D

Total visits experience delay: N ÷ 100 => 5,000

Total visits affected: 8%N => 40,000 Impacts:a. Lot of visits b. Repeated visits in a day

1 visit (In our distributed system): 8 downstream calls =>interacting with S

(99% fast & 1% slow)

N: No. of visits (500,000) D: Delay (50 ms) S: Highly active service(suffering from longtail latency)

1 visit encountering latency: 1-(0.99^8) = 1-0.922 => 0.077 ≈ 8%(likelihood)

Page 26: Distributed tracing - get a grasp on your production

@nklmish

Boss need solution

Page 27: Distributed tracing - get a grasp on your production

@nklmish

But we still don’t know…

Request timeline (When it started & which operation)

Logs-Correlation

How the same operation behaved across different cluster/region/zone.

How much deviation comparing to acceptable value.

Call graph

Page 28: Distributed tracing - get a grasp on your production

@nklmish

Bob was missing Distributed Tracing

Page 29: Distributed tracing - get a grasp on your production

@nklmish

Distributed tracing

Tracks request flow.

Fast reaction (Traced data available within mins)

Dynamically instruments apps.

System insight, critical path, understanding call graphs (which services, which operations, at what time, etc.)

Measuring E2E latency

Call patterns (Optimisation) & bug discovering (Spotting redundant requests, sync vs async)

Page 30: Distributed tracing - get a grasp on your production

@nklmish

How can we apply this knowledge

Page 31: Distributed tracing - get a grasp on your production

@nklmish

Via Tracing system

Tracing system should:

Trace

Have Low overhead

Be scaleable

Work 24 * 7 * 365 (production bugs are difficult to reproduce)

Shouldn’t :

Rely on programmers collaboration

Page 32: Distributed tracing - get a grasp on your production

@nklmish

OpenZipkin - OpenSource tracing system

Page 33: Distributed tracing - get a grasp on your production

@nklmish

OpenZipkin

Zipkin is:

Distributed tracing system

Created by twitter

Based on Dapper.

OpenZipkin:

Github organisation

Primary Fork of Zipkin

Opensource

Pluggable architecture

Page 34: Distributed tracing - get a grasp on your production

@nklmish

Span

Denotes logical unit of work done (Timestamped)

Work done is expressed in human readable string (operation name)

Created by tracer (instrumenting code)

Slim (KiB or less)

Root span - span without parent id

Page 35: Distributed tracing - get a grasp on your production

@nklmish

Zipkin annotations

Clien

t

Serv

er

cs

sr

ss

HTTP Request: get catalog

(span starts)

cr

HTTP Response: catalog

(span ends)

(Processing time = ss - sr)

(Response time = cr - cs)

(Network latency = sr - cs)

(Network latency = cr - ss)

cs: cl

ient s

end

ss: ser

ver se

nd

cr: cl

ient r

eceive

d

sr: ser

ver re

ceived

Page 36: Distributed tracing - get a grasp on your production

@nklmish

It’s all about trace & span

HTTP Request: get catalog CataloService: getCatalog()

(traceId:1, parentId:, spanId: 1)

PriceService: getPrice()

(traceId:1, parentId: 1, spanId: 2)

ProductService: getProducts()

(traceId:1, parentId: 1, spanId: 3)

Database call (traceId:1, parentId: 3,

spanId: 4)

Data analytic call (traceId:1, parentId: 3,

spanId: 5)

SpanTrace

Page 37: Distributed tracing - get a grasp on your production

@nklmish

Trace (E2e latency graph)

DAG of spans, forms latency tree.

Page 38: Distributed tracing - get a grasp on your production

@nklmish

Demohttps://github.com/nklmish/java-

distributed-tracing-demo

https://github.com/nklmish/go-distributed-tracing-demo

Page 39: Distributed tracing - get a grasp on your production

@nklmish

Demo application - Zipkin visualises dependencies

Page 40: Distributed tracing - get a grasp on your production

@nklmish

Zipkin’s architecture

APICollector UI

Transport

service (instrume-nted)

Storage

Receive spans

Scribe/kafkaDeserialising, sampling & scheduling for storage

DB

Store spans

cassandra/mysql/elastic-search

visualize

retrieves data

Collect & convert spans

Page 41: Distributed tracing - get a grasp on your production

@nklmish

Tags

Tag denotes:

key-value pair

Not timestamped

A span may contain zero or more tags

Page 42: Distributed tracing - get a grasp on your production

@nklmish

Log

Log denotes:

Event name (mark meaningful moment in lifetime of a span)

Timestamped

A span may contain zero or more logs

Page 43: Distributed tracing - get a grasp on your production

@nklmish

Annotations

Helps explaining latency with a timestamp.

Annotations are often codes. e.g. sr, cs, etc.

Page 44: Distributed tracing - get a grasp on your production

@nklmish

Binary Annotations

Tags a span with context, usually to support query or aggregation. (e.g. http.path)

Repeatable and vary on the host.

Page 45: Distributed tracing - get a grasp on your production

@nklmish

Can I have large spans ( e.g. MiB)

Decrease usability & increases cost of tracing system

Page 46: Distributed tracing - get a grasp on your production

@nklmish

Beware of clock skew!!!

10:00 10:00

Page 47: Distributed tracing - get a grasp on your production

@nklmish

Beware of clock skew!!!

10:00:01 10:00:22

Page 48: Distributed tracing - get a grasp on your production

@nklmish

Tracer

Does most of the heavy lifting e.g. span creation, context generation, passing info, data propagation, etc.

Page 49: Distributed tracing - get a grasp on your production

@nklmish

Sampling

Controls how much to record

High traffic Systems, fraction of traffic is enough

Low traffic Systems, adjust based on your needs

Note: Debug spans are always recorded.

Page 50: Distributed tracing - get a grasp on your production

@nklmish

Opentracing

Standardise tracing

Vendor neutral tracing API

Implementation available in 6 languages

http://opentracing.io/documentation/

Page 51: Distributed tracing - get a grasp on your production

@nklmish

Spring cloud sleuth zipkin

Brings distributed tracing to spring cloud

Spring cloud starter zipkin (Zipkin + sleuth)

Supports

Hystrix

Async

Rest template

Feign

Zuul

Spring integration

http://tiny.cc/scs-doc

Page 52: Distributed tracing - get a grasp on your production

@nklmish

Code Walkthroughhttps://github.com/nklmish/java-

distributed-tracing-demo

https://github.com/nklmish/go-distributed-tracing-demo

Page 53: Distributed tracing - get a grasp on your production

@nklmish

Who uses tracing

http://tiny.cc/tracing-impl

Page 54: Distributed tracing - get a grasp on your production

@nklmish

Zipkin & Prometheus

Page 55: Distributed tracing - get a grasp on your production

@nklmish

Zipkin for…

Page 56: Distributed tracing - get a grasp on your production

@nklmish

Summary : Latency is never zero, embrace it

Page 57: Distributed tracing - get a grasp on your production

@nklmish

Summary

Distributed systems hard to reason, complex call graphs

Distributed tracing helps to analyse E2E latency & understanding call graphs

Instrumentation is tricky (async, thread pool, callbacks, etc.)

OpenZipkin provides:

open source tracing system

Visualises request flow

Spring cloud sleuth brings tracing to spring world

OpenTracing - goal to standardised tracing

Page 58: Distributed tracing - get a grasp on your production

@nklmish

Thank You

Questions?

http://tiny.cc/tracinghttp://tiny.cc/tracing-slidesSlides =>

Review =>

Source Code