Upload
gary-bridges
View
212
Download
0
Embed Size (px)
Citation preview
Performance Debugging for
Distributed Systems of Black Boxes
Marcos K. AguileraJeffrey C. MogulJanet L. Wiener
HP Labs Patrick Reynolds, Duke
Athicha Muthitacharoen, MIT
WISP 200411 November 2004
page 220 October 2003 Project5 - SOSP
Example multi-tier system
client
web server
client
web serverweb server
database server
database server
application server
application server
authentication server
page 320 October 2003 Project5 - SOSP
Motivation
• Complex distributed systems are built from black box components
• These systems may have performance problems
• High or erratic latency
• Locating the causes of these problems is hard
• We can’t always examine or modify system components
• We need tools to infer where bottlenecks are
• Choose which black boxes to open
page 420 October 2003 Project5 - SOSP
Contributions of our work
• Tools to highlight which black boxes have problems
• Require only passive information, such as packet traces
• Infer where most of time is spent from traces
• Person can then use more invasive tools to examine those boxes
• Reduce time and cost to debug complex systems
• Improve quality of delivered systems
page 520 October 2003 Project5 - SOSP
Example causal path
client
web server
client
web serverweb server
database server
database server
application server
application server
authentication server
100ms
page 620 October 2003 Project5 - SOSP
Goals of our tools
• Find high-impact causal paths through a distributed systemCausal path: series of nodes that sent/received
messages– Each message is caused by receipt of previous
message– Some causal paths occur many times
High impact:– Occurs frequently– Contributes significantly to overall latency
• Without modifications or semantic knowledge• Report per-node latencies on causal paths
page 720 October 2003 Project5 - SOSP
Overview of our approach
• Obtain traces of messages between components– Ethernet packets, middleware messages, etc.– Collect traces as non-invasively as possible
• Analyze traces using algorithms
• Visualize results and highlight high-impact paths
• Requires very little information: [timestamp, source, destination]
page 820 October 2003 Project5 - SOSP
Outline
• Problem statement & goals• Overview of our approach• Algorithm• Experimental results• Related work• Conclusions
page 920 October 2003 Project5 - SOSP
The convolution algorithm: input
Time From To
0.01 A B
0.02 A B
0.04 B D
0.05 C F
...
page 1020 October 2003 Project5 - SOSP
The convolution algorithm: output
A
C DB
E FE FFE
G G G G G G
.15.10 0 0
.10.10
0 0 0 0 00
page 1120 October 2003 Project5 - SOSP
Basic idea
• Creates a “time signal” for messages from each node
• Given time signals S1(t)=(AB) and S2(t)=(BX)
Computes convolution of S2(t) and S1(–t) = S1 * S2
(can be computed quickly using fast fourier transforms)
S1(t)=(AB msgs)
1 2 3 4 5 6 7 time
page 1220 October 2003 Project5 - SOSP
S1(t)=(AB msgs)
S2(t)=(BX msgs)
S1 * S2=conv(S2(t), S1(-t))
• Spikes suggest causality between ASpikes suggest causality between AB and BB and BX msgsX msgs• Time shift of a spike indicates its characteristic delayTime shift of a spike indicates its characteristic delay
page 1320 October 2003 Project5 - SOSP
Details: first step
• Choose starting node A• Use trace to add edges from it
Time From To 0.01 A B 0.02 A B 0.04 A C 0.05 A B
A
B C
page 1420 October 2003 Project5 - SOSP
Continuing
Time From To … B D … B E … B F … B G
A
B C
??
(AB)*(BE) (AB)*(BD)
d
page 1520 October 2003 Project5 - SOSP
How
Time From To t1 A B t2 A B t3 A B t4 A B
Time From To
… t1+d B D … t2+d B D … t3+d B D t3+d B E … t4+d B D
page 1620 October 2003 Project5 - SOSP
Heuristic to find spikes
threshold 1: n1 stddev over meanthreshold 2: n2 stddev over mean
n1 = 2n2 = 1.5
page 1720 October 2003 Project5 - SOSP
Recursing to continue
• Observations: 1. (BD) are not all msgs from B to D (only those caused by A)
2. Stop recursion when too few messages left or no more spikes found
A
B
D
d
??
page 1820 October 2003 Project5 - SOSP
Outline
• Problem statement & goals• Overview of our approach• Algorithm• Experimental results• Conclusions
page 1920 October 2003 Project5 - SOSP
Results: email service delays
• Jeff logged all email headers for two months • Parsed 80K Received headers in 12K messagesReceived: from cceexg11.americas.cpqcorp.net ...by wera.hpl.hp.com ... ; Fri, 4 Apr 2003 15:35:54 -0800
– Yields (timestamp, sender, receiver) trace records• Used Convolution Algorithm to
– Reconstruct message paths– Find typical delays
• Note: this is NOT the most direct way to use email headers– We made the problem harder so as to test our algorithm
page 2020 October 2003 Project5 - SOSP
Email trace: output
60
39 6737
40 3840 383840
41 41 41 41 41 41
4890,15
7380,10
4600
7660 523
0,10
7680,10
4780
5940
4390
6260
5120
6350
page 2120 October 2003 Project5 - SOSP
Results: Petstore
• Sun’s demo application for J2EE
• Stanford’s PinPoint project provided us with traces– One trace has a node that is
artificially slowed down
page 2220 October 2003 Project5 - SOSP
Future work
•Automate trace gathering and conversion•Sliding-window versions of algorithms
– Find phased behavior– Reduce memory usage of nesting algorithm– Improve speed of convolution algorithm
•Validate usefulness on more complicated systems
•What are limits of our approach?
page 2320 October 2003 Project5 - SOSP
Conclusion
• Looking for bottlenecks in black box systems
• Use signal processing techniques to find causal pathsin the network and its delays
• For more information– http://www.hpl.hp.com/research/project5/
• Contact us if you have multi-hop message traces!