Upload
arielle-urwin
View
216
Download
2
Tags:
Embed Size (px)
Citation preview
Triage: Diagnosing Production Run Failures at the User’s SiteJoseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou
Department of Computer ScienceUniversity Illinois, Urbana Champaign
Joseph Tucek CS-UIUC Page 2
Despite all of our effort, production runs still fail What do we do about these failures?
Joseph Tucek CS-UIUC Page 3
What is (currently) done about end-user failures?
Dumps leave much manual effort to diagnose We still need to reproduce the bug
This is hard, if not impossible, to do
Joseph Tucek CS-UIUC Page 4
Why on-site diagnosis of production run failures? Production run bugs are valuable
Not caught in testing Potentially environment specific
Causing real damage to end users We can’t diagnose production failures off-site
Reproduction is hard The programmer doesn’t have the end-user environment
Privacy concerns limit even the reports we do get
We must diagnose at the end-user’s site
Joseph Tucek CS-UIUC Page 5
What do we mean by diagnosis?
Diagnosis traces back to the underlying fault Core dumps tell you about the failure Bug detection tells you about some errors Existing diagnosis tools are offline
trigger
fault errorfailure
service interruptionincorrect state
e.g. smashed stack
root cause
buggy line of code
Joseph Tucek CS-UIUC Page 6
What do we need to perform diagnosis? (1) We need information about the failure
What is the fault, the error, the propagation tree? Off-site:
Repeatedly inspect the bug (e.g. with a debugger) We run analysis tools targeted at the failure, or at
suspected failures
Off-site techniques don’t work on-site Reproducing the bug is non-trivial We don’t know what specific failures will occur Existing analysis tools are too expensive
Joseph Tucek CS-UIUC Page 7
What do we need to perform diagnosis? (2) We need guidance as to what to do next
What analysis should we perform, what is likely to work well, and what variables are interesting?
Off-site: The programmer decides, based on past knowledge
On-site, there is no programmer. Any decisions as to action must be made
automatically.
Joseph Tucek CS-UIUC Page 8
What do we need to perform diagnosis? (3) We need to try “what-if’s” with the execution
If we change this input, what happens? Skip this function?
Off-site: Programmers run many input variations Even with differing code.
This is difficult on-site Most replay focuses on minimizing variance We can’t understand what the results mean
Joseph Tucek CS-UIUC Page 9
What does Triage contribute?
Enables on-site diagnosis Uses systems techniques to make offline analysis
tools feasible on-site Addresses the three previous challenges
Allows a new technique, delta analysis
Human study Real programmers and real bugs
Show large time savings in time-to-fix
Joseph Tucek CS-UIUC Page 10
Overview
Introduction Addressing the three challenges Diagnosis process & design Experimental results
Human study Overhead
Related work Conclusions
Joseph Tucek CS-UIUC Page 11
Getting information about the failure
Checkpoint/re-execution can capture the bug The environment, input, memory state, etc.
Everything we need to reproduce the bug
Benefits: We can relive the failure over and over Dynamically plug in analysis tools “on-demand”
Makes the expensive cheap Normal-run overhead is low too
Joseph Tucek CS-UIUC Page 12
Guidance about what to do next
A human-like diagnosis protocol can guide the diagnosis process Repeated replay lets us diagnose incrementally Based on past results, we can pick the next step
E.g. if the bug doesn’t always repeat, we should look for races
Stage Goal
1 failure/error type & location
2 failure triggering conditions
3 Fault related code & variables
Joseph Tucek CS-UIUC Page 13
Trying “what-ifs” with the execution
Flexible re-execution lets us play with what-ifs Three types of re-execution
Plain – deterministic Loose – allow some variance Wild – introduce (potentially large) variations
Extracts how they differ with delta analysis
Joseph Tucek CS-UIUC Page 14
Main idea of Triage
How to get information about the failure? Capture the bug with checkpoint/re-execution Relive the bug with various diagnostic techniques
How to decide what to do? Use a human-like protocol to select analysis Incrementally increase our understanding of the bug
How to try out “what-if” scenarios? Flexible re-execution allows varied executions Delta analysis points out what makes them different
Joseph Tucek CS-UIUC Page 15
Overview
Introduction Addressing the three challenges Diagnosis process & design Experimental results
Human study Overhead
Related work Conclusions
Joseph Tucek CS-UIUC Page 16
Triage Architecture
Checkpointing Subsystem
Analysis Tools
(e.g. backward
slicing, bug detection)
Control
Unit
(Protocol)
Joseph Tucek CS-UIUC Page 17
Triage vs. Rx
Both are in memory Both support variations in execution
Triage has no output commit Triage has no need for safety
Can even skip code
Triage considers why the failure occurs Tries to analyze the failure
Joseph Tucek CS-UIUC Page 18
Failure analysis & delta generation (stage 1 and 2)Bounds checking (1.1x)
Assertion checking (1x)
Happens-before (12x)
Atomicity detection (60x)
Static core analysis (1x)
Taint analysis (2x)
Dynamic Slicing (1000x)
Symbolic exec. (1000x)
Lockset analysis (20x)
Rearrange allocation
Drop inputs
Mutate inputs
Pad buffers
Change file state
Drop code
Reschedule threads
Change libraries
Reorder messages
The differences caused by variations are useful as well
Joseph Tucek CS-UIUC Page 19
Delta analysis
A
BC
D
EF
G
A
BC
X
E
G
Y
A
BC
D
EF
G
X
Y
{A:1 B:1 C:1 D:1 X:0 E:1 F:1 G:1 Y:0}
{A:1 B:1 C:1 D:0 X:1 E:1 F:0 G:1 Y:1}
{A:0 B:0 C:0 D:1 X:1 E:0 F:1 G:0 Y:1}
Compute the basic block vector:
Joseph Tucek CS-UIUC Page 20
Delta analysis
From delta generation’s many runs, Triage finds the “most similar” Compare the basic block vectors
Triage will diff the two closest runs The minimum edit distance, aka shortest edit script
A B C D E F G
- ^ V
A B C X E G Y
Joseph Tucek CS-UIUC Page 21
A bug in TARchar *get_directory_contents (char *path, dev_t device){ struct accumulator *accumulator; /* Recursively scan the given PATH. */ { char *dirp = savedir (path); char const *entry; size_t entrylen; char *name_buffer; size_t name_buffer_size; size_t name_length; struct directory *directory; enum children children;
if (! dirp) savedir_error (path); errno = 0;
name_buffer_size = strlen (path) + NAME_FIELD_SIZE; name_buffer = xmalloc (name_buffer_size + 2); strcpy (name_buffer, path); if (! ISSLASH (path[strlen (path) - 1])) strcat (name_buffer, "/"); name_length = strlen (name_buffer);
directory = find_directory (path); children = directory ? directory->children : CHANGED_CHILDREN;
accumulator = new_accumulator ();
if (children != NO_CHILDREN) for (entry = dirp; (entrylen = strlen (entry)) != 0; entry += entrylen + 1)
char *savedir (const char *dir){ DIR *dirp; struct dirent *dp; char *name_space; size_t allocated = NAME_SIZE_DEFAULT; size_t used = 0; int save_errno;
dirp = opendir (dir); if (dirp == NULL) return NULL;
name_space = xmalloc (allocated);
errno = 0; while ((dp = readdir (dirp)) != NULL) { char const *entry = dp->d_name; if (entry[entry[0] != '.' ? 0 : entry[1] != '.' ? 1 : 2] != '\0')
{ size_t entry_size = strlen (entry) + 1; if (used + entry_size < used) xalloc_die (); if (allocated <= used + entry_size) { do
{ if (2 * allocated < allocated) xalloc_die (); allocated *= 2;}
while (allocated <= used + entry_size);
Segmentation fault
null point dereference
Execution difference
Joseph Tucek CS-UIUC Page 22
Sample Triage report
Failure point Segfault in lib strlen Stack & heap OK
Bug detection Deterministic bug Null pointer at
incremen.c:207
Fault propagation
dirp = opendir (dir);
if (dirp == NULL) return NULL;
dirp = savedir (path);
entry = dirp;
strlen(entry)
Joseph Tucek CS-UIUC Page 23
Results – Human Study
We tested Triage with a human study 15 programmers drawn from faculty, research
programmers, and graduate students No undergraduates!
Measured time to repair bugs, with/without Triage Everybody got core dumps, sample inputs, instructions on
how to replicate, and access to many debugging tools Including Valgrind
3 simple toy bugs, & 2 real bugs The TAR bug you just saw A copy-paste error in BC
Joseph Tucek CS-UIUC Page 24
Time to fix a bug
We hope that the report is be easy to check We cut out the reproduction step
This is quite unfair to Triage Also, we put a time limit
Over time is counted as max time
reproduce find failure …error …fault fix it
check Triage report fix it
Joseph Tucek CS-UIUC Page 25
Results – Human study
For the real bugs, Triage strongly helps (47%) Better than 99.99% confidence that with < without
Joseph Tucek CS-UIUC Page 26
Results – Other BugsΔ
Generation
Δ Analysis
Dynamic
Slicing
Apache Input element 12% 8 instructions
Apache Input element 69% 3 instructions
CVS -- -- 4 functions
MySQL interleaving -- --
Squid 1 character 71% 6 instructions
BC array padding 98% 3 instructions
Linux-ext -- -- 6 instructions
MAN -- -- 9 functions
NCOMP -- -- 5 instructions
TAR file perms 68% 6 instructions
Joseph Tucek CS-UIUC Page 27
Results – Normal Run Overhead
Identical to checkpoint system (Rx) overhead Under 5%
Joseph Tucek CS-UIUC Page 28
Results – Diagnosis Overhead
CPU bound is the worst case Still reasonable because we’re only redoing 200ms
Delta analysis is somewhat costly Should be run in the background
Joseph Tucek CS-UIUC Page 29
Related work
Checkpointing & re-execution Zap [Osman, OSDI’02], TTVM [King, USENIX’05]
Bug detection & diagnosis Valgrind [Nethercote], CCured [Necula, POPL’02], Purify
[Hastings, USENIX’92] Eraser [Savage, TOCS’97], [Netzer , PPoPP’91] Backward slicing [Weiser, CACM’82] Innumerable others
Execution variation Input variation
Delta debugging [Zeller, FSE’02], Fuzzing [B. So] Environment variation
Rx [Qin, SOSP’05] DieHard [Berger, PLDI’06]
Joseph Tucek CS-UIUC Page 30
Conclusions & Future Work
On-site diagnosis can be made feasible Checkpoint can effectively capture the failure Expensive off-line analysis can be done on-site Privacy issues are minimized
Also useful for in house testing Reduces the manual portion of analysis
Future work Automatic bug hot fixes Visualization of delta analysis
Joseph Tucek CS-UIUC Page 31
Thank you
Questions?
Special thanks to Hewlett-Packard for student scholarship support.
This work supported by NSF, DoE, and Intel