33
© 2019 Bloomberg Finance L.P. All rights reserved. Tracing Real-time Distributed Systems SRECon EMEA 2019 03 Oct 2019 Evgeny Yakimov @TheEvDev SRE @ Bloomberg 1

SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

  • Upload
    others

  • View
    9

  • Download
    2

Embed Size (px)

Citation preview

Page 1: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

© 2019 Bloomberg Finance L.P. All rights reserved.

Tracing Real-time Distributed Systems

SRECon EMEA 201903 Oct 2019

Evgeny Yakimov @TheEvDevSRE @ Bloomberg

1

Page 2: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

© 2019 Bloomberg Finance L.P. All rights reserved.

Overview

● About Bloomberg Who are you?

● Recap of Distributed Tracing What’s this new craze about?

● What tracing systems look like I only read the abstract...

● Tracing at Bloomberg What’s your spin on it?

● What’s “Real-time Data Streaming”? Why are you so special?

● What challenges we faced? Was it really that hard?

● What’s next? Where are you going with this?

2

Page 3: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

About Bloomberg

3

Page 4: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

© 2019 Bloomberg Finance L.P. All rights reserved.

What is the Bloomberg Terminal?

4

Page 5: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

© 2019 Bloomberg Finance L.P. All rights reserved.

About Bloomberg

● Around 20,000 employees globally with 5,500+ engineers

● One of the largest private networks in the world

● 100 billion market data “ticks” processed daily

● Focus on innovation and long-term growth

● Philanthropy at the heart of our organisation

Bloomberg is a technology company that aims to facilitate financial decision making by providing transparency to global markets.

5

Page 6: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

Recap of Distributed Tracing

6

Page 7: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

© 2019 Bloomberg Finance L.P. All rights reserved.

©2018 Google LLC, used with permission. Google and the Google logo are registered trademarks of Google LLC.

Example - A Search Engine

Input Query

Main Search

News Search

Classification

Entity Lookup

Media Lookup

7

Page 8: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

© 2019 Bloomberg Finance L.P. All rights reserved.

(Possible) Architecture View Trace View

Load Balancer

Classification Service

Entity Info Service

Media Service

Main Search Service News Service

Application Server

8

Page 9: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

© 2019 Bloomberg Finance L.P. All rights reserved.

Trace Concepts - Span Model

Trace

The journey of an input/event

Span

The work done in some component along the trace

Carrier

A protocol (such as RPC/Messaging Layer) which communicates

trace-aware messages

9

Page 10: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

What tracing systems look like

10

Page 11: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

© 2019 Bloomberg Finance L.P. All rights reserved.

Node

Node

Tracing Backend

Distribution

Service

App

Tracing Architecture (One Interpretation)

Tracer

Tracer TracingAgent

TracingAgent

Storage

Analytics

Alerting Monitoring

Tracing Tools

SDK

SDK

11

RPC Server Framework

RPC Client Library

Page 12: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

© 2019 Bloomberg Finance L.P. All rights reserved.

Using Trace

● Triage/Debugging

● Analysis

● Visualization

● Metrics?

● ...

12

Page 13: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

Tracing at Bloomberg

13

Page 14: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

© 2019 Bloomberg Finance L.P. All rights reserved.

Tracing Model & Implementation

● Mostly following OpenTracing specification

● Custom library implementations in targeted languages,

primarily C++ due to widespread usage

● Own agents and distribution

○ Built on-top of existing telemetry infrastructure (metrics, logs)

● Integration with open source tooling

○ JaegerUI reads from our back-end

14

Page 15: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

© 2019 Bloomberg Finance L.P. All rights reserved.

Our Journey

● Started with learning from Dapper

● Tried targeting isolated sub-systems

● Need top-down complete coverage

● Minimize work for app teams

● Build tracing into app frameworks

● Add tracing support for middleware

What’s Next for Tracing?

● Instrument more middleware

● More app frameworks/languages

● Improve our storage

● More flexible head-based sampling

● Tail-based sampling and analysis

● Increase adoption by training

15

Page 16: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

Real-time Data Stream Applications

16

Page 17: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

© 2019 Bloomberg Finance L.P. All rights reserved.

A Trading System

I’d like to buy some stocks!

17

Page 18: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

© 2019 Bloomberg Finance L.P. All rights reserved.

Tracing Today Client-side

Architecture - Simple View

Data Processing PipelineMarket Data

Connectivity

Application

Trading System

Order Data

Front-end Publisher

FIX

Analytics

Global

(100 Billion/Day)

(100 Million/Day)

(10 Million/Day)

(100 Million/Day)

(100 Million/Day)

18

Page 19: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

© 2019 Bloomberg Finance L.P. All rights reserved.

System Characteristics

Real-time Streaming

Most traffic uses event-based streaming protocols (usually no polling)

Session Based Subscriptions

Stateful sessions/active connections are common, initial load, caching

Fan-Out/Fan-in Distribution Patterns

For complex intersecting dataflows

Flow Control

Commonly utilise: throttling, batching, chunking, conflation

19

Page 20: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

© 2019 Bloomberg Finance L.P. All rights reserved.

Front-end Publisher

How we traced it?

Trading System

Front-end Publisher

● We used the Span Model

Messaging Library

Messaging Library

Msg(payload, traceContext)

createSpan()

createSpan()

● Embed tracing into our communication libraries● Implement Carrier

● Create Spans

● Create Traces in application code

● Context propagation in libraries and application code

subscribe()● Manually instrument untraced middleware

20

● Relationships are usually: FollowsFrom

Page 21: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

© 2019 Bloomberg Finance L.P. All rights reserved.

What does it look like?

21

Request - Response

Message/Streaming (async)

Page 22: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

© 2019 Bloomberg Finance L.P. All rights reserved.

The interesting bit

22

Async event handling

Fan-out message flow

Intra-process latency (nodes)

Inter-process latency (edge)

Sub-system latency

Page 23: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

© 2019 Bloomberg Finance L.P. All rights reserved.

What we do with it?

● Issue analysis (debugging, triaging)

● Metrics extraction (and events)

● Basis for (sub)system Service Level Indicators

● Advanced analysis

○ Identifying dependencies

○ Interactive Jupyter notebook powered workflows

○ Join with other telemetry data (logs)

23

Page 24: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

Challenges

24

Page 25: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

© 2019 Bloomberg Finance L.P. All rights reserved.

Volume!

● A span is large! (let's say 1KB)

● We generate 500 Million spans a day

● Storage for 30 days 15 Billion spans

Using cloud databases you would be looking at approx $20,000/month

● A lot of data to analyse

● Why not sample?

{ "finish": "2018-03-12T13:49:47.558+00:00", "intTags.uuid": 123456, "operation": "paintSubscriptionData", "origin": { "cluster": "tsprodcluster", "fqdn": "node1234", "language": "c++", "lib": "dtr", "libVersion": "dtr-1.2.3", "parentCluster": "tscluster", "process": "FrontEndPublisherOrders.tsk", "processId": 54321, "schemaVersion": "dtmsg-1.1.1", "threadId": 1234567890 }, "references": [ { "referenceType": "followsFrom", "spanId": "87654321", "traceId": "abcdef12-87654321-abcdef12-87654321" } ], "sampled": false, "spanId": "12131415", "start": "2018-03-12T13:49:47.538+00:00", "stringTags.modelId": "1", "traceId": "abcdef12-87654321-abcdef12-87654321"}

25

Page 26: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

© 2019 Bloomberg Finance L.P. All rights reserved.

#1 Message Fan-Out (broadcast)

• Inefficient Distribution

• Late stage filtering (up to 80% discard)

• Redundancy (hot/warm replicas)

• Results in Noisy traces!

Solution: Cancel the Span Collection

26

def handle_message(message): span = scope_manager.get_current_span() if not is_relevant(message): span.cancel() _ return # ...

Page 27: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

© 2019 Bloomberg Finance L.P. All rights reserved.

#2 Splitting Messages

Solution: Create new spans

“Dispatch” Spans

Part 1

Part 2

Part 3

• Multi-part messages

• Can take different paths

Chunk 1

Chunk 2

Chunk 3

• Large messages (initial paint)

• Handled independently (timing)

27

Service

Service1

Service2

Service3

t1

t2

t4

Publish

OnMsg1

OnMsg2

OnMsg1

Part1 Part2

Page 28: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

© 2019 Bloomberg Finance L.P. All rights reserved.

Service 1 Service 2

#3 Message Conflation

Solution: Use “conflation” spans

• Multiple upstream sources

• High rate of messages

• Often only last value relevant

• Data is often flattened/conflated

• Published on interval “flush()”

28

HandleA

Flush

PublishA

PublishB

HandleB

• The “distant relationship”

• Tooling support? Trace View

Page 29: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

© 2019 Bloomberg Finance L.P. All rights reserved.

#4 Increasing Granularity

• Spans are Expensive

• Tags/Attributes are not

Solution: Span-like Tag Semantics

• TimeSpans - Linear Span

• CheckPoints - Time Reading

from traceutil import timespan, checkpoint _

def handle_message(message): span = scope_manager.get_current_span() with timespan("DoingSomething", span): _ do_something() do_something_else() checkpoint("DoneSomethingElse", span) _ # ...

{ "operationName" : "publishMessage", ... "tags" : [

{ "key" : "convertMessage_finish", "value" : "20AUG2019_09:30:08.710575+0000"},{ "key" : "convertMessage_start", "value" : "20AUG2019_09:30:08.710537+0000"},{ "key" : "publishMessageInterleaver_checkpoint", "value" : "20AUG2019_09:30:08.710400+0000"}

]}

29

Page 30: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

© 2019 Bloomberg Finance L.P. All rights reserved.

#5 - Sampling

• Head-based (trace creation time)

Service A Service B Service C

Span 1:1 Span 1:2 Span 1:3

Span 2:1 Span 2:2 Span 2:3

Span 3:1 Span 3:2 Span 3:3

Sampling Decision Sampling Decision Sampling Decision

Sampling Decision

Tracing Backend

• Unitary (specific components)

• Tail-based (once trace is known)

30

Sampling Decision

Page 31: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

What’s Next?

31

Page 32: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

© 2019 Bloomberg Finance L.P. All rights reserved.

What’s Next?

● Trace More!

● Tracing the Client-Side (heavily async)

● Near real-time trace processing

○ Derive Metrics (and events)

○ Alerting

32

Page 33: SRE @ Bloomberg Tracing Real-time · © 2019 Bloomberg Finance L.P. All rights reserved. What is the Bloomberg Terminal? 4

© 2019 Bloomberg Finance L.P. All rights reserved.

33

We are hiring!https://www.bloomberg.com/careers

Thank you for listening!

Questions?