Skua: Extending Distributed Tracing Vertically into the ... · • Treats the Linux kernel as the...

Preview:

Citation preview

Skua: Extending Distributed Tracing Vertically into the Linux Kernel

Harshal Sheth and Andrew Sun

DevConf 2018

!1

Distributed Systems• Complex applications are no longer

monolithic

• Modular/agile development

• Continuous deployment

• Independent scaling

!2

Distributed Systems• Complex applications are no longer

monolithic

• Modular/agile development

• Continuous deployment

• Independent scaling

• Increasingly seen in large companies

• Hard to debug

!2

Twitter, 2013

Example Distributed System

!3

Web Results

Images

PageRank

VisualRank

Frontend

Legend

request

application within a distributed system

request pathway

Example Distributed System

!3

Web Results

Images

PageRank

VisualRank

Frontend

Legend

request

application within a distributed system

request pathway

Example Distributed System

!3

Web Results

Images

PageRank

VisualRank

Frontend

Legend

request

application within a distributed system

request pathway

Example Distributed System

!3

Web Results

Images

PageRank

VisualRank

Frontend

Legend

request

application within a distributed system

request pathway

Example Distributed System

!3

Web Results

Images

PageRank

VisualRank

Frontend

Legend

request

application within a distributed system

request pathway

Example Distributed System

!3

Web Results

Images

PageRank

VisualRank

Frontend

Legend

request

application within a distributed system

request pathway

Example Distributed System

!3

Web Results

Images

PageRank

VisualRank

Frontend

Legend

request

application within a distributed system

request pathway

Example Distributed System

!3

Web Results

Images

PageRank

VisualRank

Frontend

Legend

request

application within a distributed system

request pathway

Example Distributed System

!3

Web Results

Images

PageRank

VisualRank

Frontend

Legend

request

application within a distributed system

request pathway

Example Distributed System

!3

Web Results

Images

PageRank

VisualRank

Frontend

Legend

request

application within a distributed system

request pathway

Example Distributed System

!3

Web Results

Images

PageRank

VisualRank

Frontend

Legend

request

application within a distributed system

request pathway

Example Distributed System

!3

Web Results

Images

PageRank

VisualRank

Frontend

Legend

request

application within a distributed system

request pathway

#*$@%!

Example Distributed System

!3

Web Results

Images

PageRank

VisualRank

Frontend

Legend

request

application within a distributed system

request pathway

??

?? ? ?

? ??

?

??

?

#*$@%!

Distributed Tracing

• Monitoring and troubleshooting distributed systems

• Discovering latency issues

• Graphing service dependencies

• Root-cause analysis of backend issues

• Tracing a specific request through the entire system

!4

What does distributed tracing miss?• There’s more to performance than

meets the eye of existing distributed tracing tools

• Contention between applications

• Kernel bugs

• Security patches (e.g. Meltdown/Spectre)

• Can we gain visibility regarding these issues via the kernel?

!5

User Application

Distributed

Tracing Client

Kernel

Our goal: extend distributed tracing into the kernel

!6

Our Approach

!7

+ =Jaeger

distributed tracing framework from Uber

LTTngLinux kerneltrace toolkit

Skua

Traces and Spans

!8

Frontend

Tim

e

Frontend Request Trace ID:

1337 Parent ID:

(none) Span ID:

7893

Traces and Spans

!8

Frontend

Tim

e

Frontend Request Trace ID:

1337 Parent ID:

(none) Span ID:

7893

Traces and Spans

!8

Frontend

Tim

e

Context: trace ID

parent ID span ID sampled

Frontend Request Trace ID:

1337 Parent ID:

(none) Span ID:

7893

Web Results Trace ID: 1337 Parent ID: 7893 Span ID: 3460

Traces and Spans

!8

Web Results

Frontend

Tim

e

Context: trace ID

parent ID span ID sampled

Frontend Request Trace ID:

1337 Parent ID:

(none) Span ID:

7893

Web Results Trace ID: 1337 Parent ID: 7893 Span ID: 3460

Traces and Spans

!8

Web Results PageRank

Frontend

Tim

e

Context: trace ID

parent ID span ID sampled

PageRank Trace ID: 1337 Parent ID: 3460 Span ID: 1231Frontend

Request Trace ID:

1337 Parent ID:

(none) Span ID:

7893

Web Results Trace ID: 1337 Parent ID: 7893 Span ID: 3460

Traces and Spans

!8

Web Results

Images

PageRank

Frontend

Tim

e

Context: trace ID

parent ID span ID sampled

PageRank Trace ID: 1337 Parent ID: 3460 Span ID: 1231

Images Trace ID: 1337 Parent ID: 7893 Span ID: 8652

Frontend Request Trace ID:

1337 Parent ID:

(none) Span ID:

7893

VisualRank Trace ID: 1337 Parent ID: 8652 Span ID: 3460

Web Results Trace ID: 1337 Parent ID: 7893 Span ID: 3460

Traces and Spans

!8

Web Results

Images

PageRank

VisualRank

Frontend

Tim

e

Context: trace ID

parent ID span ID sampled

PageRank Trace ID: 1337 Parent ID: 3460 Span ID: 1231

Images Trace ID: 1337 Parent ID: 7893 Span ID: 8652

Frontend Request Trace ID:

1337 Parent ID:

(none) Span ID:

7893

VisualRank Trace ID: 1337 Parent ID: 8652 Span ID: 3460

Web Results Trace ID: 1337 Parent ID: 7893 Span ID: 3460

Traces and Spans

!8

Web Results

Images

PageRank

VisualRank

Frontend

Tim

e

Context: trace ID

parent ID span ID sampled

PageRank Trace ID: 1337 Parent ID: 3460 Span ID: 1231

Images Trace ID: 1337 Parent ID: 7893 Span ID: 8652

Frontend Request Trace ID:

1337 Parent ID:

(none) Span ID:

7893

🎲

VisualRank Trace ID: 1337 Parent ID: 8652 Span ID: 3460

Web Results Trace ID: 1337 Parent ID: 7893 Span ID: 3460

Traces and Spans

!8

Web Results

Images

PageRank

VisualRank

Frontend

Tim

e

Context: trace ID

parent ID span ID sampled

PageRank Trace ID: 1337 Parent ID: 3460 Span ID: 1231

Images Trace ID: 1337 Parent ID: 7893 Span ID: 8652

Frontend Request Trace ID:

1337 Parent ID:

(none) Span ID:

7893

Span Aggregator

🎲

Jaeger

!9

https://eng.uber.com/distributed-tracing/

Skua Design

!10

User ApplicationJaeger Client

Skua Design

!10

User ApplicationJaeger Client

Jaeger Framework

Skua Design

!10

User ApplicationJaeger Client

Linux Kernel

Jaeger Framework

sysc

alls

Kernel Events

Skua Design

!10

User ApplicationJaeger Client

Linux Kernel

LTTng Kernel Modules

Jaeger Framework

sysc

alls

Kernel Events

Skua Design

!10

User ApplicationJaeger Client

Linux Kernel

Kernel Module using procfs

LTTng Kernel Modules

Jaeger Framework

sysc

alls

Kernel Events

Skua Design

!10

User ApplicationJaeger Client

Linux Kernel

Kernel Module using procfs

LTTng Kernel Modules

Jaeger Framework

task_structsy

scal

ls

Kernel Events

Skua Design

!10

User ApplicationJaeger Client

Linux Kernel

Kernel Module using procfs

LTTng Kernel Modules

Jaeger Framework

task_structsy

scal

ls

Kernel Events

Skua Design

!10

User ApplicationJaeger Client

Linux Kernel

Kernel Module using procfs

LTTng Kernel Modules

LTTng Adapter

Jaeger Framework

task_structsy

scal

ls

Kernel Events

Skua Design

!10

User ApplicationJaeger Client

Linux Kernel

Kernel Module using procfs

LTTng Kernel Modules

LTTng Adapter

Jaeger Framework

task_structsy

scal

ls

Kernel Events

Skua Design

!10

User ApplicationJaeger Client

Linux Kernel

Kernel Module using procfs

LTTng Kernel Modules

LTTng Adapter

Jaeger Framework

task_structsy

scal

ls

Kernel Events

2

1

6

4

3

6

5

Skua Details• Jaeger C++ client sends its context into the kernel

• Treats the Linux kernel as the next level of the span hierarchy

• Each syscall is considered a span

• Tracepoint events become span logs

• LTTng kernel modules tag each span and log with the context information

• Custom adapter sends kernel data into the Jaeger

!11

80 LOC

25 LOC

250 LOC

Evaluation

!12

CorrectnessSetup

• Small C++ program

• Spawns a few threads

• Makes 10 different syscalls

• Verifies that Skua is correctly recording syscalls

!13

Results

• Syscalls recorded in Jaeger as spans

• Misses a few syscalls

• vDSO — gettimeofday

• LTTng instrumentation

• Tracepoint events recorded properly as logs

Performance Benchmarks• Tests run on KVM virtual machine assigned 24 vCPUs (2 × Intel Xeon X5670), 16 GB

RAM, Linux kernel 4.15.14 with Skua modifications

• Traced 0.1% of requests

!14

Benchmark Scenario Program Instrumentation Kernel Tracing via Modified LTTng

No Tracing None No

Unmodified Jaeger Original Jaeger client No

Jaeger + procfs Jaeger client modified to send trace context into kernel No

LTTng None Yes, but output filtered using LTTng filters

Skua Jaeger client modified to send trace context into kernel Yes, output sent to adapter

Tiny HTTP Server

• Created a small C++ Web server using uWebSockets

• Used benchmarking tool autocannon

• Sent 1 million GET requests using 10 connections

• Evaluated throughput and latency under different tracing scenarios

!15

!16

Web Server ThroughputRe

ques

ts/S

ec

0

3000

6000

9000

12000

Benchmark Scenario

No Trac

ing

Unmod

ified Ja

eger

Jaeg

er + proc

fsLT

Tng

Skua

1020510310111111162911628

Web Server Latency

Late

ncy

(ms)

0

0.15

0.3

0.45

0.6

Benchmark Scenario

No Trac

ing

Unmod

ified Ja

eger

Jaeg

er + proc

fsLT

Tng

Skua

0.53

0.410.45

0.360.33

4.0%11.3% 12.2%

0.0%

+0.12+0.08

+0.20

+0.03

Fortunes Benchmark• Code stolen borrowed from TechEmpower’s Web Frameworks benchmark

• Retrieves list of fortunes from database and renders HTML page

• Uses Spring Boot, Kotlin, OpenJDK 10, PostgreSQL 10.4, OpenTracing integration

• “Real-world” Web application

• Similar benchmarking process, but autocannon run twice for JIT warmup and with 100 connections

!17

!18

!19

Fortunes ThroughputRe

ques

ts/S

ec

0

1350

2700

4050

5400

Benchmark Scenario

No Trac

ing

Unmod

ified Ja

eger

Jaeg

er + proc

fsLT

Tng

Skua

50265184535253525352Fortunes Latency

Late

ncy

(ms)

0

5

10

15

20

Benchmark Scenario

No Trac

ing

Unmod

ified Ja

eger

Jaeg

er + proc

fsLT

Tng

Skua

19.2318.6318.0418.0418.050.0%3.1%

6.1%0.0%

-0.01+0.58

+1.18-0.01

Performance Overheads• Unmodified Jaeger has a negligible impact on performance

• LTTng causes a moderate decrease in throughput and a small increase in latency

• This could be improved by enabling a subset of available instrumentation points

• Our modifications to Jaeger cause additional latency (depending on scenario)

• Performing syscalls to propagate the trace context is expensive

• Ingestion of kernel events is more work

• In the tiny HTTP benchmark, Web server transactions took under 1ms each, causing the latency impacts to appear large by comparison

!20

Future Work

• Improve performance

• Simplify installation process

• Adaptive sampling reconfiguration

• Attempt tracing Ceph with Skua

!21

Logging What Matters: Just-In-Time Instrumentation And Tracing

Lily Sturmann, Emre Ates Friday, Aug. 17 4:30pm

Tracing Ceph using Jaeger-Blkkin Mania Abdi

Saturday, Aug. 18 12:00pm

Conclusions

• Can use distributed tracing to monitor and debug complex distributed systems

• Current open source distributed tracing frameworks miss kernel information

• Skua integrates LTTng kernel data with Jaeger tracing

• Skua has some impact on throughput and latency

!22

https://github.com/docc-lab/skua

Acknowledgements

• Raja Sambasivan - Mentor

!23

Backup Slides

!24

Correctness Tester Syscalls• getpid

• getppid

• gettid

• gettimeofday

• nanosleep

• open

• close

• write

• fstat

• futex

!25

!26

Existing Tracing Frameworks

• Dapper (2010) — Google

• Zipkin (2012) —Twitter

• Canopy (2017) — Facebook

• Jaeger (2017) — open sourced by Uber

!27

Recommended