Upload
vuque
View
227
Download
0
Embed Size (px)
Citation preview
Skua: Extending Distributed Tracing Vertically into the Linux Kernel
Harshal Sheth and Andrew Sun
DevConf 2018
!1
Distributed Systems• Complex applications are no longer
monolithic
• Modular/agile development
• Continuous deployment
• Independent scaling
!2
Distributed Systems• Complex applications are no longer
monolithic
• Modular/agile development
• Continuous deployment
• Independent scaling
• Increasingly seen in large companies
• Hard to debug
!2
Twitter, 2013
Example Distributed System
!3
Web Results
Images
PageRank
VisualRank
Frontend
Legend
request
application within a distributed system
request pathway
Example Distributed System
!3
Web Results
Images
PageRank
VisualRank
Frontend
Legend
request
application within a distributed system
request pathway
Example Distributed System
!3
Web Results
Images
PageRank
VisualRank
Frontend
Legend
request
application within a distributed system
request pathway
Example Distributed System
!3
Web Results
Images
PageRank
VisualRank
Frontend
Legend
request
application within a distributed system
request pathway
Example Distributed System
!3
Web Results
Images
PageRank
VisualRank
Frontend
Legend
request
application within a distributed system
request pathway
Example Distributed System
!3
Web Results
Images
PageRank
VisualRank
Frontend
Legend
request
application within a distributed system
request pathway
Example Distributed System
!3
Web Results
Images
PageRank
VisualRank
Frontend
Legend
request
application within a distributed system
request pathway
Example Distributed System
!3
Web Results
Images
PageRank
VisualRank
Frontend
Legend
request
application within a distributed system
request pathway
Example Distributed System
!3
Web Results
Images
PageRank
VisualRank
Frontend
Legend
request
application within a distributed system
request pathway
Example Distributed System
!3
Web Results
Images
PageRank
VisualRank
Frontend
Legend
request
application within a distributed system
request pathway
Example Distributed System
!3
Web Results
Images
PageRank
VisualRank
Frontend
Legend
request
application within a distributed system
request pathway
Example Distributed System
!3
Web Results
Images
PageRank
VisualRank
Frontend
Legend
request
application within a distributed system
request pathway
#*$@%!
Example Distributed System
!3
Web Results
Images
PageRank
VisualRank
Frontend
Legend
request
application within a distributed system
request pathway
??
?? ? ?
? ??
?
??
?
#*$@%!
Distributed Tracing
• Monitoring and troubleshooting distributed systems
• Discovering latency issues
• Graphing service dependencies
• Root-cause analysis of backend issues
• Tracing a specific request through the entire system
!4
What does distributed tracing miss?• There’s more to performance than
meets the eye of existing distributed tracing tools
• Contention between applications
• Kernel bugs
• Security patches (e.g. Meltdown/Spectre)
• Can we gain visibility regarding these issues via the kernel?
!5
User Application
Distributed
Tracing Client
Kernel
Our goal: extend distributed tracing into the kernel
!6
Our Approach
!7
+ =Jaeger
distributed tracing framework from Uber
LTTngLinux kerneltrace toolkit
Skua
Traces and Spans
!8
Frontend
Tim
e
Frontend Request Trace ID:
1337 Parent ID:
(none) Span ID:
7893
Traces and Spans
!8
Frontend
Tim
e
Frontend Request Trace ID:
1337 Parent ID:
(none) Span ID:
7893
Traces and Spans
!8
Frontend
Tim
e
Context: trace ID
parent ID span ID sampled
Frontend Request Trace ID:
1337 Parent ID:
(none) Span ID:
7893
Web Results Trace ID: 1337 Parent ID: 7893 Span ID: 3460
Traces and Spans
!8
Web Results
Frontend
Tim
e
Context: trace ID
parent ID span ID sampled
Frontend Request Trace ID:
1337 Parent ID:
(none) Span ID:
7893
Web Results Trace ID: 1337 Parent ID: 7893 Span ID: 3460
Traces and Spans
!8
Web Results PageRank
Frontend
Tim
e
Context: trace ID
parent ID span ID sampled
PageRank Trace ID: 1337 Parent ID: 3460 Span ID: 1231Frontend
Request Trace ID:
1337 Parent ID:
(none) Span ID:
7893
Web Results Trace ID: 1337 Parent ID: 7893 Span ID: 3460
Traces and Spans
!8
Web Results
Images
PageRank
Frontend
Tim
e
Context: trace ID
parent ID span ID sampled
PageRank Trace ID: 1337 Parent ID: 3460 Span ID: 1231
Images Trace ID: 1337 Parent ID: 7893 Span ID: 8652
Frontend Request Trace ID:
1337 Parent ID:
(none) Span ID:
7893
VisualRank Trace ID: 1337 Parent ID: 8652 Span ID: 3460
Web Results Trace ID: 1337 Parent ID: 7893 Span ID: 3460
Traces and Spans
!8
Web Results
Images
PageRank
VisualRank
Frontend
Tim
e
Context: trace ID
parent ID span ID sampled
PageRank Trace ID: 1337 Parent ID: 3460 Span ID: 1231
Images Trace ID: 1337 Parent ID: 7893 Span ID: 8652
Frontend Request Trace ID:
1337 Parent ID:
(none) Span ID:
7893
VisualRank Trace ID: 1337 Parent ID: 8652 Span ID: 3460
Web Results Trace ID: 1337 Parent ID: 7893 Span ID: 3460
Traces and Spans
!8
Web Results
Images
PageRank
VisualRank
Frontend
Tim
e
Context: trace ID
parent ID span ID sampled
PageRank Trace ID: 1337 Parent ID: 3460 Span ID: 1231
Images Trace ID: 1337 Parent ID: 7893 Span ID: 8652
Frontend Request Trace ID:
1337 Parent ID:
(none) Span ID:
7893
🎲
VisualRank Trace ID: 1337 Parent ID: 8652 Span ID: 3460
Web Results Trace ID: 1337 Parent ID: 7893 Span ID: 3460
Traces and Spans
!8
Web Results
Images
PageRank
VisualRank
Frontend
Tim
e
Context: trace ID
parent ID span ID sampled
PageRank Trace ID: 1337 Parent ID: 3460 Span ID: 1231
Images Trace ID: 1337 Parent ID: 7893 Span ID: 8652
Frontend Request Trace ID:
1337 Parent ID:
(none) Span ID:
7893
Span Aggregator
🎲
Jaeger
!9
https://eng.uber.com/distributed-tracing/
Skua Design
!10
User ApplicationJaeger Client
Skua Design
!10
User ApplicationJaeger Client
Jaeger Framework
Skua Design
!10
User ApplicationJaeger Client
Linux Kernel
Jaeger Framework
sysc
alls
Kernel Events
Skua Design
!10
User ApplicationJaeger Client
Linux Kernel
LTTng Kernel Modules
Jaeger Framework
sysc
alls
Kernel Events
Skua Design
!10
User ApplicationJaeger Client
Linux Kernel
Kernel Module using procfs
LTTng Kernel Modules
Jaeger Framework
sysc
alls
Kernel Events
Skua Design
!10
User ApplicationJaeger Client
Linux Kernel
Kernel Module using procfs
LTTng Kernel Modules
Jaeger Framework
task_structsy
scal
ls
Kernel Events
Skua Design
!10
User ApplicationJaeger Client
Linux Kernel
Kernel Module using procfs
LTTng Kernel Modules
Jaeger Framework
task_structsy
scal
ls
Kernel Events
Skua Design
!10
User ApplicationJaeger Client
Linux Kernel
Kernel Module using procfs
LTTng Kernel Modules
LTTng Adapter
Jaeger Framework
task_structsy
scal
ls
Kernel Events
Skua Design
!10
User ApplicationJaeger Client
Linux Kernel
Kernel Module using procfs
LTTng Kernel Modules
LTTng Adapter
Jaeger Framework
task_structsy
scal
ls
Kernel Events
Skua Design
!10
User ApplicationJaeger Client
Linux Kernel
Kernel Module using procfs
LTTng Kernel Modules
LTTng Adapter
Jaeger Framework
task_structsy
scal
ls
Kernel Events
2
1
6
4
3
6
5
Skua Details• Jaeger C++ client sends its context into the kernel
• Treats the Linux kernel as the next level of the span hierarchy
• Each syscall is considered a span
• Tracepoint events become span logs
• LTTng kernel modules tag each span and log with the context information
• Custom adapter sends kernel data into the Jaeger
!11
80 LOC
25 LOC
250 LOC
Evaluation
!12
CorrectnessSetup
• Small C++ program
• Spawns a few threads
• Makes 10 different syscalls
• Verifies that Skua is correctly recording syscalls
!13
Results
• Syscalls recorded in Jaeger as spans
• Misses a few syscalls
• vDSO — gettimeofday
• LTTng instrumentation
• Tracepoint events recorded properly as logs
Performance Benchmarks• Tests run on KVM virtual machine assigned 24 vCPUs (2 × Intel Xeon X5670), 16 GB
RAM, Linux kernel 4.15.14 with Skua modifications
• Traced 0.1% of requests
!14
Benchmark Scenario Program Instrumentation Kernel Tracing via Modified LTTng
No Tracing None No
Unmodified Jaeger Original Jaeger client No
Jaeger + procfs Jaeger client modified to send trace context into kernel No
LTTng None Yes, but output filtered using LTTng filters
Skua Jaeger client modified to send trace context into kernel Yes, output sent to adapter
Tiny HTTP Server
• Created a small C++ Web server using uWebSockets
• Used benchmarking tool autocannon
• Sent 1 million GET requests using 10 connections
• Evaluated throughput and latency under different tracing scenarios
!15
!16
Web Server ThroughputRe
ques
ts/S
ec
0
3000
6000
9000
12000
Benchmark Scenario
No Trac
ing
Unmod
ified Ja
eger
Jaeg
er + proc
fsLT
Tng
Skua
1020510310111111162911628
Web Server Latency
Late
ncy
(ms)
0
0.15
0.3
0.45
0.6
Benchmark Scenario
No Trac
ing
Unmod
ified Ja
eger
Jaeg
er + proc
fsLT
Tng
Skua
0.53
0.410.45
0.360.33
4.0%11.3% 12.2%
0.0%
+0.12+0.08
+0.20
+0.03
Fortunes Benchmark• Code stolen borrowed from TechEmpower’s Web Frameworks benchmark
• Retrieves list of fortunes from database and renders HTML page
• Uses Spring Boot, Kotlin, OpenJDK 10, PostgreSQL 10.4, OpenTracing integration
• “Real-world” Web application
• Similar benchmarking process, but autocannon run twice for JIT warmup and with 100 connections
!17
!18
!19
Fortunes ThroughputRe
ques
ts/S
ec
0
1350
2700
4050
5400
Benchmark Scenario
No Trac
ing
Unmod
ified Ja
eger
Jaeg
er + proc
fsLT
Tng
Skua
50265184535253525352Fortunes Latency
Late
ncy
(ms)
0
5
10
15
20
Benchmark Scenario
No Trac
ing
Unmod
ified Ja
eger
Jaeg
er + proc
fsLT
Tng
Skua
19.2318.6318.0418.0418.050.0%3.1%
6.1%0.0%
-0.01+0.58
+1.18-0.01
Performance Overheads• Unmodified Jaeger has a negligible impact on performance
• LTTng causes a moderate decrease in throughput and a small increase in latency
• This could be improved by enabling a subset of available instrumentation points
• Our modifications to Jaeger cause additional latency (depending on scenario)
• Performing syscalls to propagate the trace context is expensive
• Ingestion of kernel events is more work
• In the tiny HTTP benchmark, Web server transactions took under 1ms each, causing the latency impacts to appear large by comparison
!20
Future Work
• Improve performance
• Simplify installation process
• Adaptive sampling reconfiguration
• Attempt tracing Ceph with Skua
!21
Logging What Matters: Just-In-Time Instrumentation And Tracing
Lily Sturmann, Emre Ates Friday, Aug. 17 4:30pm
Tracing Ceph using Jaeger-Blkkin Mania Abdi
Saturday, Aug. 18 12:00pm
Conclusions
• Can use distributed tracing to monitor and debug complex distributed systems
• Current open source distributed tracing frameworks miss kernel information
• Skua integrates LTTng kernel data with Jaeger tracing
• Skua has some impact on throughput and latency
!22
https://github.com/docc-lab/skua
Acknowledgements
• Raja Sambasivan - Mentor
•
!23
Backup Slides
!24
Correctness Tester Syscalls• getpid
• getppid
• gettid
• gettimeofday
• nanosleep
• open
• close
• write
• fstat
• futex
!25
!26
Existing Tracing Frameworks
• Dapper (2010) — Google
• Zipkin (2012) —Twitter
• Canopy (2017) — Facebook
• Jaeger (2017) — open sourced by Uber
!27