Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul...

Preview:

Citation preview

Performance Debugging for

Distributed Systems of Black Boxes

Marcos K. AguileraJeffrey C. MogulJanet L. Wiener

HP Labs Patrick Reynolds, Duke

Athicha Muthitacharoen, MIT

WISP 200411 November 2004

page 220 October 2003 Project5 - SOSP

Example multi-tier system

client

web server

client

web serverweb server

database server

database server

application server

application server

authentication server

page 320 October 2003 Project5 - SOSP

Motivation

• Complex distributed systems are built from black box components

• These systems may have performance problems

• High or erratic latency

• Locating the causes of these problems is hard

• We can’t always examine or modify system components

• We need tools to infer where bottlenecks are

• Choose which black boxes to open

page 420 October 2003 Project5 - SOSP

Contributions of our work

• Tools to highlight which black boxes have problems

• Require only passive information, such as packet traces

• Infer where most of time is spent from traces

• Person can then use more invasive tools to examine those boxes

• Reduce time and cost to debug complex systems

• Improve quality of delivered systems

page 520 October 2003 Project5 - SOSP

Example causal path

client

web server

client

web serverweb server

database server

database server

application server

application server

authentication server

100ms

page 620 October 2003 Project5 - SOSP

Goals of our tools

• Find high-impact causal paths through a distributed systemCausal path: series of nodes that sent/received

messages– Each message is caused by receipt of previous

message– Some causal paths occur many times

High impact:– Occurs frequently– Contributes significantly to overall latency

• Without modifications or semantic knowledge• Report per-node latencies on causal paths

page 720 October 2003 Project5 - SOSP

Overview of our approach

• Obtain traces of messages between components– Ethernet packets, middleware messages, etc.– Collect traces as non-invasively as possible

• Analyze traces using algorithms

• Visualize results and highlight high-impact paths

• Requires very little information: [timestamp, source, destination]

page 820 October 2003 Project5 - SOSP

Outline

• Problem statement & goals• Overview of our approach• Algorithm• Experimental results• Related work• Conclusions

page 920 October 2003 Project5 - SOSP

The convolution algorithm: input

Time From To

0.01 A B

0.02 A B

0.04 B D

0.05 C F

...

page 1020 October 2003 Project5 - SOSP

The convolution algorithm: output

A

C DB

E FE FFE

G G G G G G

.15.10 0 0

.10.10

0 0 0 0 00

page 1120 October 2003 Project5 - SOSP

Basic idea

• Creates a “time signal” for messages from each node

• Given time signals S1(t)=(AB) and S2(t)=(BX)

Computes convolution of S2(t) and S1(–t) = S1 * S2

(can be computed quickly using fast fourier transforms)

S1(t)=(AB msgs)

1 2 3 4 5 6 7 time

page 1220 October 2003 Project5 - SOSP

S1(t)=(AB msgs)

S2(t)=(BX msgs)

S1 * S2=conv(S2(t), S1(-t))

• Spikes suggest causality between ASpikes suggest causality between AB and BB and BX msgsX msgs• Time shift of a spike indicates its characteristic delayTime shift of a spike indicates its characteristic delay

page 1320 October 2003 Project5 - SOSP

Details: first step

• Choose starting node A• Use trace to add edges from it

Time From To 0.01 A B 0.02 A B 0.04 A C 0.05 A B

A

B C

page 1420 October 2003 Project5 - SOSP

Continuing

Time From To … B D … B E … B F … B G

A

B C

??

(AB)*(BE) (AB)*(BD)

d

page 1520 October 2003 Project5 - SOSP

How

Time From To t1 A B t2 A B t3 A B t4 A B

Time From To

… t1+d B D … t2+d B D … t3+d B D t3+d B E … t4+d B D

page 1620 October 2003 Project5 - SOSP

Heuristic to find spikes

threshold 1: n1 stddev over meanthreshold 2: n2 stddev over mean

n1 = 2n2 = 1.5

page 1720 October 2003 Project5 - SOSP

Recursing to continue

• Observations: 1. (BD) are not all msgs from B to D (only those caused by A)

2. Stop recursion when too few messages left or no more spikes found

A

B

D

d

??

page 1820 October 2003 Project5 - SOSP

Outline

• Problem statement & goals• Overview of our approach• Algorithm• Experimental results• Conclusions

page 1920 October 2003 Project5 - SOSP

Results: email service delays

• Jeff logged all email headers for two months • Parsed 80K Received headers in 12K messagesReceived: from cceexg11.americas.cpqcorp.net ...by wera.hpl.hp.com ... ; Fri, 4 Apr 2003 15:35:54 -0800

– Yields (timestamp, sender, receiver) trace records• Used Convolution Algorithm to

– Reconstruct message paths– Find typical delays

• Note: this is NOT the most direct way to use email headers– We made the problem harder so as to test our algorithm

page 2020 October 2003 Project5 - SOSP

Email trace: output

60

39 6737

40 3840 383840

41 41 41 41 41 41

4890,15

7380,10

4600

7660 523

0,10

7680,10

4780

5940

4390

6260

5120

6350

page 2120 October 2003 Project5 - SOSP

Results: Petstore

• Sun’s demo application for J2EE

• Stanford’s PinPoint project provided us with traces– One trace has a node that is

artificially slowed down

page 2220 October 2003 Project5 - SOSP

Future work

•Automate trace gathering and conversion•Sliding-window versions of algorithms

– Find phased behavior– Reduce memory usage of nesting algorithm– Improve speed of convolution algorithm

•Validate usefulness on more complicated systems

•What are limits of our approach?

page 2320 October 2003 Project5 - SOSP

Conclusion

• Looking for bottlenecks in black box systems

• Use signal processing techniques to find causal pathsin the network and its delays

• For more information– http://www.hpl.hp.com/research/project5/

• Contact us if you have multi-hop message traces!