30
Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah http://issg.cs.duke.edu/pip/ [email protected] Chip Killian Amin Vahdat Patrick Reynolds

Pip Detecting the Unexpected in Distributed Systems

Embed Size (px)

DESCRIPTION

Pip Detecting the Unexpected in Distributed Systems. Janet Wiener Jeff Mogul Mehul Shah. Patrick Reynolds. Chip Killian Amin Vahdat. http://issg.cs.duke.edu/pip/ [email protected]. Motivation. Distributed systems exhibit complex behaviors Some behaviors are unexpected - PowerPoint PPT Presentation

Citation preview

Page 1: Pip Detecting the Unexpected in Distributed Systems

PipDetecting the Unexpected in

Distributed SystemsJanet

WienerJeff Mogul

Mehul Shah

http://issg.cs.duke.edu/pip/

[email protected]

Chip KillianAmin

Vahdat

Patrick Reynolds

Page 2: Pip Detecting the Unexpected in Distributed Systems

page 2Pip - November 2005

Motivation

• Distributed systems exhibit complex behaviors• Some behaviors are unexpected

– Structural bugs• Placement or timing of processing and communication

– Performance problems• Throughput bottlenecks

• Over- or under-consumption of resources

• Unexpected interdependencies

• Parallel, inter-node behavior is hard to capture with serial, single-node tools– Not captured by traditional debuggers, profilers– Not captured by unstructured log files

Page 3: Pip Detecting the Unexpected in Distributed Systems

page 3Pip - November 2005

Motivation

Three target audiences:• Primary programmer

– Debugging or optimizing his/her own system

• Secondary programmer– Inheriting a project or joining a programming team– Learning how the system behaves

• Operator– Monitoring running system for unexpected

behavior– Performing regression tests after a change

Page 4: Pip Detecting the Unexpected in Distributed Systems

page 4Pip - November 2005

Motivation

• Programmers wish to examine and check system-wide behaviors– Causal paths– Components of end-to-end

delay– Attribution of resource

consumption

• Unexpected behavior might indicate a bug

Web server

App server

Database

500ms

2000page faults

Page 5: Pip Detecting the Unexpected in Distributed Systems

page 5Pip - November 2005

Pip overview

Pip:1. Captures events from a running

system2. Reconstructs behavior from events3. Checks behavior against expectations4. Displays unexpected behavior

• Both structure and resource violations

Goal: help programmers locate and explain bugs

Behaviormodel

Application

Application

Expectations

Pip checker

Unexpectedstructure

Resourceviolations

Pip explorer: visualization GUI

Page 6: Pip Detecting the Unexpected in Distributed Systems

page 6Pip - November 2005

Outline

• Expressing expected behavior• Building a model of actual behavior• Exploring application behavior• Results

– FAB– RanSub– SplitStream

Page 7: Pip Detecting the Unexpected in Distributed Systems

page 7Pip - November 2005

Describing application behavior

• Application behavior consists of paths– All events, on any node, related to one high-level

operation– Definition of a path is programmer defined– Path is often causal, related to a user request

WWW

App server

DB

Parse HTTP

Query

Send response

Run application

time

Page 8: Pip Detecting the Unexpected in Distributed Systems

page 8Pip - November 2005

Describing application behavior

• Within paths are tasks, messages, and notices– Tasks: processing with start and end points– Messages: send and receive events for any

communication• Includes network, synchronization (lock/unlock), and

timers

– Notices: time-stamped strings; essentially log entries

WWW

App server

DB

Parse HTTP

Query

Send response

Run application

time

“Request = /cgi/…” “2096 bytes in response”

“done with request 12”

Page 9: Pip Detecting the Unexpected in Distributed Systems

page 9Pip - November 2005

Expectations: Recognizers

• Application behavior consists of paths• Each recognizer matches paths

– A path can match more than one recognizer• A recognizer can be a validator, an invalidator, or

neither• Any path matching zero validators or at least one

invalidator is unexpected behavior: bug?validator CGIRequest task(“Parse HTTP”) limit(CPU_TIME, 100ms); notice(m/Request URL: .*/); send(AppServer); recv(AppServer);invalidator DatabaseError notice(m/Database error: .*/);

Page 10: Pip Detecting the Unexpected in Distributed Systems

page 10Pip - November 2005

Expectations: Recognizers language

• repeat: matches a ≤ n ≤ b copies of a block

• xor: matches any one of several blocks

• call: include another recognizer (macro)• future: block matches now or later

– done: force named block to match

repeat between 1 and 3 { … }

xor {branch: …branch: …

}

future F1 { … }…done(F1);

Page 11: Pip Detecting the Unexpected in Distributed Systems

page 11Pip - November 2005

Expectations: Aggregate expectations

• Recognizers categorize paths into sets• Aggregates make assertions about sets of paths

– Count, unique count, resource constraints– Simple math and set operators

assert(instances(CGIRequest) > 4);assert(max(CPU_TIME, CGIRequest) < 500ms);assert(max(REAL_TIME, CGIRequest) <= 3*avg(REAL_TIME, CGIRequest));

Page 12: Pip Detecting the Unexpected in Distributed Systems

page 12Pip - November 2005

Outline

• Expressing expected behavior• Building a model of actual behavior• Exploring application behavior• Results

Page 13: Pip Detecting the Unexpected in Distributed Systems

page 13Pip - November 2005

Building a behavior model

Sources of events:• Annotations in source code

– Programmer inserts statements manually• Annotations in middleware

– Middleware inserts annotations automatically– Faster and less error-prone

• Passive tracing or interposition– Easier, but less information

• Or any combination of the above

Model consists of paths constructed from events recorded by the running application

Page 14: Pip Detecting the Unexpected in Distributed Systems

page 14Pip - November 2005

Annotations

• Set path ID• Start/end task• Send/receive message• Notice

WWW

App server

DB

Parse HTTP

Query

Send response

Run application

time

“Request = /cgi/…” “2096 bytes in response”

“done with request 12”

Page 15: Pip Detecting the Unexpected in Distributed Systems

page 15Pip - November 2005

Automating expectations and annotations

• Expectations can be generated from behavior model– Create a recognizer for each actual

path– Eliminate repetition– Strike a balance between over- and

under-specification

• Annotations can be generated by middleware

• Automatic annotations in Mace, Sandstorm, J2EE, FAB– Several of our test systems use

Mace annotations

Behaviormodel

Application

Application

Expectations

Pip checker

Annotations

Unexpectedbehavior

Page 16: Pip Detecting the Unexpected in Distributed Systems

page 16Pip - November 2005

Checking expectations

Traces

Categorized paths

Reconciliation

Events database

Paths

Path construction

Expectation checking

Application

Application

For each path P For each recognizer R Does R match P?Check each aggregate

Expectations

Match start/end task, send/receive message

Organize events into causal paths

Page 17: Pip Detecting the Unexpected in Distributed Systems

page 17Pip - November 2005

Exploring behavior

• Expectations checker generates lists of valid and invalid paths

• Explore both sets– Why did invalid paths occur?– Is any unexpected behavior misclassified as

valid?• Insufficiently constrained expectations

• Pip may be unable to express all expectations

• Two ways to explore behavior– SQL queries over tables

• Paths, threads, tasks, messages, notices

– Visualization

Page 18: Pip Detecting the Unexpected in Distributed Systems

page 18Pip - November 2005

Timing and resource properties for one taskCausal view of path

Visualization: causal paths

Caused tasks, messages, and notices on that thread

Page 19: Pip Detecting the Unexpected in Distributed Systems

page 19Pip - November 2005

Visualization: communication graph

• Graph view of all host-to-host network traffic

Page 20: Pip Detecting the Unexpected in Distributed Systems

page 20Pip - November 2005

Visualization: performance graphs

• Plot per-task or per-path resource metrics– Cumulative distribution (CDF), probability density (PDF), or vs. time

• Click on a point to see its value and the task/path represented

Time (s)

Dela

y (

ms)

Page 21: Pip Detecting the Unexpected in Distributed Systems

page 21Pip - November 2005

Pip vs. printf

• Both record interesting events to check off-line– Pip imposes structure and automates checking– Generalizes ad hoc approaches

Pip printfNesting, causal order UnstructuredTime, path, and thread No contextCPU and I/O data No resource informationAutomatic verification using declarative language

Verification with ad hoc grep or expect scripts

SQL queries “Queries” using Perl scripts

Automatic generation for some middleware

Manual placement

Page 22: Pip Detecting the Unexpected in Distributed Systems

page 22Pip - November 2005

Results

• We have applied Pip to several distributed systems:– FAB: distributed block store– SplitStream: DHT-based multicast protocol– RanSub: tree-based protocol used to build higher-

level systems– Others: Bullet, SWORD, Oracle of Bacon

• We have found unexpected behavior in each system

• We have fixed bugs in some systems… and used Pip to verify that the behavior was fixed

Page 23: Pip Detecting the Unexpected in Distributed Systems

page 23Pip - November 2005

Results: SplitStream (DHT-based multicast protocol)

13 bugs found, 12 fixed– 11 found using expectations, 2 found using GUI

• Structural bug: some nodes have up to 25 children when they should have at most 18– This bug was fixed and later reoccurred– Root cause #1: variable shadowing– Root cause #2: failed to register a callback

• How discovered: first in the explorer GUI, confirmed with automated checking

Page 24: Pip Detecting the Unexpected in Distributed Systems

page 24Pip - November 2005

Results: FAB (distributed block store)

1 bug (so far), fixed– Four protocols checked: read, write, Paxos,

membership

• Performance bug: nodes seeking quorum call self and peers in arbitrary order– Should call self last, to overlap computation– For cached blocks, should call self second-to-last

Page 25: Pip Detecting the Unexpected in Distributed Systems

page 25Pip - November 2005

Results: RanSub (tree-based protocol)

2 bugs found, 1 fixed• Structural bug: during first round of communication, parent nodes send summary messages before hearing from all children– Root cause: uninitialized state variables

• Performance bug: linear increase in end-to-end delay for the first ~2 minutes– Suspected root cause: data structure listing all

discovered nodes

Page 26: Pip Detecting the Unexpected in Distributed Systems

page 26Pip - November 2005

Future work

•Further automation of annotations, tracing– Explore tradeoffs between black-box,

annotated behavior models

•Extensible annotations– Application-specific schema for notices

•Composable expectations for large systems

Page 27: Pip Detecting the Unexpected in Distributed Systems

page 27Pip - November 2005

Related work

• Expectations-based systems– PSpec [Perl, 1993]– Meta-level compilation [Engler, 2000]– Paradyn [Miller, 1995]

• Causal paths– Pinpoint [Chen, 2002]– Magpie [Barham, 2004]– Project5 [Aguilera, 2003]

• Model checking– MaceMC [Killian, 2006]– VeriSoft [Godefroid, 2005]

Page 28: Pip Detecting the Unexpected in Distributed Systems

page 28Pip - November 2005

Conclusions

• Finding unexpected behavior can help us find bugs– Both structure and performance bugs

• Expectations serve as a high-level external specification– Summary of inter-component behavior and timing– Regression test for structure and performance

• Some bugs not exposed by expectations can be found through exploring: queries and visualization

Page 29: Pip Detecting the Unexpected in Distributed Systems

Extra slides

Page 30: Pip Detecting the Unexpected in Distributed Systems

page 30Pip - November 2005

Resource metrics

• Real time• User time, system time

– CPU time = user + system– Busy time = CPU time / real time

• Major and minor page faults (paging and allocation)

• Voluntary and involuntary context switches• Message size and latency• Number of messages sent• Causal depth of path• Number of threads, hosts in path