23
Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha Muthitacharoen, MIT WISP 2004 11 November 2004

Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha

Embed Size (px)

Citation preview

Page 1: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha

Performance Debugging for

Distributed Systems of Black Boxes

Marcos K. AguileraJeffrey C. MogulJanet L. Wiener

HP Labs Patrick Reynolds, Duke

Athicha Muthitacharoen, MIT

WISP 200411 November 2004

Page 2: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha

page 220 October 2003 Project5 - SOSP

Example multi-tier system

client

web server

client

web serverweb server

database server

database server

application server

application server

authentication server

Page 3: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha

page 320 October 2003 Project5 - SOSP

Motivation

• Complex distributed systems are built from black box components

• These systems may have performance problems

• High or erratic latency

• Locating the causes of these problems is hard

• We can’t always examine or modify system components

• We need tools to infer where bottlenecks are

• Choose which black boxes to open

Page 4: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha

page 420 October 2003 Project5 - SOSP

Contributions of our work

• Tools to highlight which black boxes have problems

• Require only passive information, such as packet traces

• Infer where most of time is spent from traces

• Person can then use more invasive tools to examine those boxes

• Reduce time and cost to debug complex systems

• Improve quality of delivered systems

Page 5: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha

page 520 October 2003 Project5 - SOSP

Example causal path

client

web server

client

web serverweb server

database server

database server

application server

application server

authentication server

100ms

Page 6: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha

page 620 October 2003 Project5 - SOSP

Goals of our tools

• Find high-impact causal paths through a distributed systemCausal path: series of nodes that sent/received

messages– Each message is caused by receipt of previous

message– Some causal paths occur many times

High impact:– Occurs frequently– Contributes significantly to overall latency

• Without modifications or semantic knowledge• Report per-node latencies on causal paths

Page 7: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha

page 720 October 2003 Project5 - SOSP

Overview of our approach

• Obtain traces of messages between components– Ethernet packets, middleware messages, etc.– Collect traces as non-invasively as possible

• Analyze traces using algorithms

• Visualize results and highlight high-impact paths

• Requires very little information: [timestamp, source, destination]

Page 8: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha

page 820 October 2003 Project5 - SOSP

Outline

• Problem statement & goals• Overview of our approach• Algorithm• Experimental results• Related work• Conclusions

Page 9: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha

page 920 October 2003 Project5 - SOSP

The convolution algorithm: input

Time From To

0.01 A B

0.02 A B

0.04 B D

0.05 C F

...

Page 10: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha

page 1020 October 2003 Project5 - SOSP

The convolution algorithm: output

A

C DB

E FE FFE

G G G G G G

.15.10 0 0

.10.10

0 0 0 0 00

Page 11: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha

page 1120 October 2003 Project5 - SOSP

Basic idea

• Creates a “time signal” for messages from each node

• Given time signals S1(t)=(AB) and S2(t)=(BX)

Computes convolution of S2(t) and S1(–t) = S1 * S2

(can be computed quickly using fast fourier transforms)

S1(t)=(AB msgs)

1 2 3 4 5 6 7 time

Page 12: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha

page 1220 October 2003 Project5 - SOSP

S1(t)=(AB msgs)

S2(t)=(BX msgs)

S1 * S2=conv(S2(t), S1(-t))

• Spikes suggest causality between ASpikes suggest causality between AB and BB and BX msgsX msgs• Time shift of a spike indicates its characteristic delayTime shift of a spike indicates its characteristic delay

Page 13: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha

page 1320 October 2003 Project5 - SOSP

Details: first step

• Choose starting node A• Use trace to add edges from it

Time From To 0.01 A B 0.02 A B 0.04 A C 0.05 A B

A

B C

Page 14: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha

page 1420 October 2003 Project5 - SOSP

Continuing

Time From To … B D … B E … B F … B G

A

B C

??

(AB)*(BE) (AB)*(BD)

d

Page 15: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha

page 1520 October 2003 Project5 - SOSP

How

Time From To t1 A B t2 A B t3 A B t4 A B

Time From To

… t1+d B D … t2+d B D … t3+d B D t3+d B E … t4+d B D

Page 16: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha

page 1620 October 2003 Project5 - SOSP

Heuristic to find spikes

threshold 1: n1 stddev over meanthreshold 2: n2 stddev over mean

n1 = 2n2 = 1.5

Page 17: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha

page 1720 October 2003 Project5 - SOSP

Recursing to continue

• Observations: 1. (BD) are not all msgs from B to D (only those caused by A)

2. Stop recursion when too few messages left or no more spikes found

A

B

D

d

??

Page 18: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha

page 1820 October 2003 Project5 - SOSP

Outline

• Problem statement & goals• Overview of our approach• Algorithm• Experimental results• Conclusions

Page 19: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha

page 1920 October 2003 Project5 - SOSP

Results: email service delays

• Jeff logged all email headers for two months • Parsed 80K Received headers in 12K messagesReceived: from cceexg11.americas.cpqcorp.net ...by wera.hpl.hp.com ... ; Fri, 4 Apr 2003 15:35:54 -0800

– Yields (timestamp, sender, receiver) trace records• Used Convolution Algorithm to

– Reconstruct message paths– Find typical delays

• Note: this is NOT the most direct way to use email headers– We made the problem harder so as to test our algorithm

Page 20: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha

page 2020 October 2003 Project5 - SOSP

Email trace: output

60

39 6737

40 3840 383840

41 41 41 41 41 41

4890,15

7380,10

4600

7660 523

0,10

7680,10

4780

5940

4390

6260

5120

6350

Page 21: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha

page 2120 October 2003 Project5 - SOSP

Results: Petstore

• Sun’s demo application for J2EE

• Stanford’s PinPoint project provided us with traces– One trace has a node that is

artificially slowed down

Page 22: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha

page 2220 October 2003 Project5 - SOSP

Future work

•Automate trace gathering and conversion•Sliding-window versions of algorithms

– Find phased behavior– Reduce memory usage of nesting algorithm– Improve speed of convolution algorithm

•Validate usefulness on more complicated systems

•What are limits of our approach?

Page 23: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha

page 2320 October 2003 Project5 - SOSP

Conclusion

• Looking for bottlenecks in black box systems

• Use signal processing techniques to find causal pathsin the network and its delays

• For more information– http://www.hpl.hp.com/research/project5/

• Contact us if you have multi-hop message traces!