Distributed tracing - get a grasp on your production

@nklmish

Distributed tracing -

get a grasp on your production

“the most wanted and missed tool in the microservice world”

@nklmish

Agenda

Why latency ?

Distributed tracing

Short demo

Zipkin & core concepts

Code walkthrough

@nklmish

Latency

@nklmish

Every little bit count

@nklmish

With scale, you see

(source: https://gist.github.com/hellerbarde/2843375)

@nklmish

Latency?

@nklmish

User waiting

@nklmish

Remember, slow pages lose users

@nklmish

Distributed systems - latency analysis

@nklmish

Story time: How bob meet longtail latency

@nklmish

Bob didn’t knew he was suffering from Longtail latency

@nklmish

Bob trying to troubleshooting longtail latency in distributed system

@nklmish

Option 1: Log Analysis

@nklmish

Lots of files

@nklmish

Looking in logs

@nklmish

Not everything in critical path.

@nklmish

Correlating logs, manual works

@nklmish

It simply doesn’t make sense

@nklmish

Option 2: What about Metrics?


@nklmish

Something is wrong


@nklmish

Can’t tell the cause


?

@nklmish

Aggregates (avg, stdev) may deceive


@nklmish

Bob, could we find out how many clients are impacted ?

@nklmish

Bob learn about percentiles

@nklmish

Clients impacted by longtail latency…

Percentile: 99th => 1 out of 100 visit experience D

Total visits experience delay: N ÷ 100 => 5,000

Total visits affected: 8%N => 40,000 Impacts:a. Lot of visits b. Repeated visits in a day

1 visit (In our distributed system): 8 downstream calls =>interacting with S

(99% fast & 1% slow)

N: No. of visits (500,000) D: Delay (50 ms) S: Highly active service(suffering from longtail latency)

1 visit encountering latency: 1-(0.99^8) = 1-0.922 => 0.077 ≈ 8%(likelihood)

@nklmish

Boss need solution

@nklmish

But we still don’t know…

Request timeline (When it started & which operation)

Logs-Correlation

How the same operation behaved across different cluster/region/zone.

How much deviation comparing to acceptable value.

Call graph

@nklmish

Bob was missing Distributed Tracing

@nklmish

Distributed tracing

Tracks request flow.

Fast reaction (Traced data available within mins)

Dynamically instruments apps.

System insight, critical path, understanding call graphs (which services, which operations, at what time, etc.)

Measuring E2E latency

Call patterns (Optimisation) & bug discovering (Spotting redundant requests, sync vs async)

@nklmish

How can we apply this knowledge

@nklmish

Via Tracing system

Tracing system should:

Trace

Have Low overhead

Be scaleable

Work 24 * 7 * 365 (production bugs are difficult to reproduce)

Shouldn’t :

Rely on programmers collaboration

@nklmish

OpenZipkin - OpenSource tracing system

@nklmish

OpenZipkin

Zipkin is:

Distributed tracing system

Created by twitter

Based on Dapper.

OpenZipkin:

Github organisation

Primary Fork of Zipkin

Opensource

Pluggable architecture

@nklmish

Span

Denotes logical unit of work done (Timestamped)

Work done is expressed in human readable string (operation name)

Created by tracer (instrumenting code)

Slim (KiB or less)

Root span - span without parent id

@nklmish

Zipkin annotations

Clien

t

Serv

er

cs

sr

ss

HTTP Request: get catalog

(span starts)

cr

HTTP Response: catalog

(span ends)

(Processing time = ss - sr)

(Response time = cr - cs)

(Network latency = sr - cs)

(Network latency = cr - ss)

cs: cl

ient s

end

ss: ser

ver se

nd

cr: cl

ient r

eceive

d

sr: ser

ver re

ceived

@nklmish

It’s all about trace & span

HTTP Request: get catalog CataloService: getCatalog()

(traceId:1, parentId:, spanId: 1)

PriceService: getPrice()

(traceId:1, parentId: 1, spanId: 2)

ProductService: getProducts()

(traceId:1, parentId: 1, spanId: 3)

Database call (traceId:1, parentId: 3,

spanId: 4)

Data analytic call (traceId:1, parentId: 3,

spanId: 5)

SpanTrace

@nklmish

Trace (E2e latency graph)

DAG of spans, forms latency tree.

@nklmish

Demohttps://github.com/nklmish/java-

distributed-tracing-demo

https://github.com/nklmish/go-distributed-tracing-demo

https://github.com/nklmish/java-distributed-tracing-demo


@nklmish

Demo application - Zipkin visualises dependencies

@nklmish

Zipkin’s architecture

APICollector UI

Transport

service (instrume-nted)

Storage

Receive spans

Scribe/kafkaDeserialising, sampling & scheduling for storage

DB

Store spans

cassandra/mysql/elastic-search

visualize

retrieves data

Collect & convert spans

@nklmish

Tags

Tag denotes:

key-value pair

Not timestamped

A span may contain zero or more tags

@nklmish

Log

Log denotes:

Event name (mark meaningful moment in lifetime of a span)

Timestamped

A span may contain zero or more logs

@nklmish

Annotations

Helps explaining latency with a timestamp.

Annotations are often codes. e.g. sr, cs, etc.

@nklmish

Binary Annotations

Tags a span with context, usually to support query or aggregation. (e.g. http.path)

Repeatable and vary on the host.

@nklmish

Can I have large spans ( e.g. MiB)

Decrease usability & increases cost of tracing system

@nklmish

Beware of clock skew!!!

10:00 10:00

@nklmish

Beware of clock skew!!!

10:00:01 10:00:22

@nklmish

Tracer

Does most of the heavy lifting e.g. span creation, context generation, passing info, data propagation, etc.

@nklmish

Sampling

Controls how much to record

High traffic Systems, fraction of traffic is enough

Low traffic Systems, adjust based on your needs

Note: Debug spans are always recorded.

@nklmish

Opentracing

Standardise tracing

Vendor neutral tracing API

Implementation available in 6 languages

http://opentracing.io/documentation/

http://opentracing.io/documentation/

@nklmish

Spring cloud sleuth zipkin

Brings distributed tracing to spring cloud

Spring cloud starter zipkin (Zipkin + sleuth)

Supports

Hystrix

Async

Rest template

Feign

Zuul

Spring integration

…

http://tiny.cc/scs-doc

@nklmish

Code Walkthroughhttps://github.com/nklmish/java-

distributed-tracing-demo


https://github.com/nklmish/java-distributed-tracing-demo


@nklmish

Who uses tracing

http://tiny.cc/tracing-impl

http://tiny.cc/tracing-impl

@nklmish

Zipkin & Prometheus

@nklmish

Zipkin for…

@nklmish

Summary : Latency is never zero, embrace it

@nklmish

Summary

Distributed systems hard to reason, complex call graphs

Distributed tracing helps to analyse E2E latency & understanding call graphs

Instrumentation is tricky (async, thread pool, callbacks, etc.)

OpenZipkin provides:

open source tracing system

Visualises request flow

Spring cloud sleuth brings tracing to spring world

OpenTracing - goal to standardised tracing

@nklmish

Thank You

Questions?

http://tiny.cc/tracinghttp://tiny.cc/tracing-slidesSlides =>

Review =>

Source Code

http://tiny.cc/tracing

http://tiny.cc/tracing-slides

Technology

Distributed tracing - get a grasp on your production