Download ppt - 5/4/10 Data Mining: Principles and Algorithms

04/10/23 Data Mining: Principles and Algorithms

1

Data Mining: Concepts and Techniques

— Chapter 11 —

—Software Bug Mining—

Jiawei Han and Micheline Kamber

Department of Computer Science

University of Illinois at Urbana-Champaign

www.cs.uiuc.edu/~hanj

©2006 Jiawei Han and Micheline Kamber. All rights reserved.Acknowledgement: Chao Liu


2


3

Outline

Automated Debugging and Failure Triage

SOBER: Statistical Model-Based Fault Localization

Fault Localization-Based Failure Triage

Copy and Paste Bug Mining

Conclusions & Future Research


4

Software Bugs Are Costly

Software is “full of bugs” Windows 2000, 35 million lines of code

63,000 known bugs at the time of release, 2 per 1000 lines Software failure costs

Ariane 5 explosion due to “errors in the software of the inertial reference system” (Ariaen-5 flight 501 inquiry board report http://ravel.esrin.esa.it/docs/esa-x-1819eng.pdf )

A study by the National Institute of Standards and Technology found that software errors cost the U.S. economy about $59.5 billion annually http://www.nist.gov/director/prog-ofc/report02-3.pdf

Testing and debugging are laborious and expensive “50% of my company employees are testers, and the rest

spends 50% of their time testing!” —Bill Gates, in 1995


5

Automated Failure Reporting

End-users as Beta testers Valuable information about failure occurrences in reality 24.5 million/day in Redmond (if all users send)

– John Dvorak, PC Magazine Widely adopted because of its usefulness

Microsoft Windows, Linux Gentoo, Mozilla applications … Any applications can implement this functionality


6

After Failures Collected …: Failure triage

Failure triage Failure prioritization:

What are the most severe bugs? Failure assignment:

Which developers should debug a given set of failures?

Automated debugging Where is the likely bug location?


7

A Glimpse on Software Bugs

Crashing bugs Symptoms: segmentation faults Reasons: memory access violations Tools: Valgrind, CCured

Noncrashing bugs Symptoms: unexpected outputs Reasons: logic or semantic errors

if ((m >= 0)) vs. if ((m >= 0) && (m != lastm)) < vs. <=, > vs. >=, etc .. j = i vs. j= i+1

Tools: No sound tools


8

Semantic Bugs Dominate

16%

3%

78%

3%

Semantic Bugs:• Application specific• Only few detectable• Mostly require annotations or specifications

Memory-related Bugs:• Many are detectable

Others

Concurrency bugs

Bug Distribution [Li et al., ICSE’07]

• 264 bugs in Mozilla and 98 bugs in Apache manually checked• 29,000 bugs in Bugzilla automatically checked

Courtesy of Zhenmin Li


9

Hacking Semantic Bugs is HARD

Major challenge: No crashes! No failure signatures No debugging hints

Major Methods Statistical debugging of semantic bugs [Liu et

al., FSE’05, TSE’06] Triage noncrashing failures through statistical

debugging [Liu et al., FSE’06]


10

Outline







11

A Running Example

void subline(char *lin, char *pat, char *sub)

{

int i, lastm, m;

lastm = -1;

i = 0;

while((lin[i] != ENDSTR)) {

m = amatch(lin, i, pat, 0);

if (m >= 0){

lastm = m;

}

if ((m == -1) || (m == i)){

i = i + 1;

} else

i = m;

}

}


{

int i, lastm, m;

lastm = -1;

i = 0;



if ((m >= 0) && (lastm != m) ){

lastm = m;

}

if ((m == -1) || (m == i)) {

i = i + 1;

} else

i = m;

}

}


{

int i, lastm, m;

lastm = -1;

i = 0;



if ((m >= 0)){

lastm = m;

}

if ((m == -1) || (m == i)) {

i = i + 1;

} else

i = m;

}

}

130 of 5542 test cases fail, no crashes

Predicate # of true # of false

(lin[i] != ENDSTR)==true

Ret_amatch < 0

Ret_amatch == 0

Ret_amatch > 0

(m >= 0) == true

(m == i) == true

(m >= -1) == true

5 1

5 1

1 5

1 54 2

2 4

1 5

Predicate evaluation as tossing a coin


12

Profile Executions as Vectors

5 1 1 5 1 5 4 2 1 5 2 4 1 5

19 1 1 19 1 19 18 2 1 19 2 18 1 19

9 1 1 9 1 9 8 2 8 2 2 8 1 9

Two passing executions

One failing execution

Extreme case Always false in passing and always true in failing …

Generalized case Different true probability in passing and failing

executions


13

Estimated Head Probability

Evaluation bias Estimated head probability from every

execution Specifically,

where and are the number of true and false evaluations in one execution.

Defined for each predicate and each execution

ft

t

nn

nX

tn fn


14

Divergence in Head Probability

Multiple evaluation biases from multiple executions

Evaluation bias as generated from models

0 1

Prob

Head Probability0 1

Prob

Head Probability

)|(~),...,,(),...,,( 2121 fmFm XfXXXSfffF )|(~),...,,(),...,,( ''

2'121 pnPn XfXXXSpppP


15

Major Challenges

No closed form of either model No sufficient number of failing executions

to estimate

0 1

Prob

Head Probability0 1

Prob

Head Probability

)|( fXf


16


17

SOBER in Summary

Source CodeSource Code

Pred2

Pred6

Pred1

Pred3 Pred2

Pred6

Pred1

Pred3

SOBERTest Suite

Test Suite


18

Previous State of the Art [Liblit et al, 2005]

Correlation analysis Context(P) = Prob(fail | P ever

evaluated)

Failure(P) = Prob(fail | P ever evaluated as true)

Increase(P) = Failure(P) – Context(P)How more likely the program fails when a predicate is ever evaluated true


19

Liblit05 in Illustration

Failing

PassingO

++

+ +++

O

OO O

O O

O

O O

Context(P) = Prob(fail | P ever evaluated)

= 4/10 = 2/5

Increase(P) = Failure(P) – Context(P) = 3/7 – 2/5 = 1/35

Failure(P) = Prob(fail | P ever evaluated as true)

= 3/7


20

SOBER in Illustration

O

++

+ +++

O

OO O

O O

O

O O

0 1

Prob

Evaluation bias

Failing

Passing

0 1

Prob

Evaluation bias


21

Difference between SOBER and Liblit05

Methodology: Liblit05: Correlation

analysis SOBER: Model-

based approach void subline(char *lin, char *pat, char

*sub)

{

1 int i, lastm, m;

2 lastm = -1;

3 i = 0;

4 while((lin[i] != ENDSTR)) {

5 m = amatch(lin, i, pat, 0);

6 if (m >= 0){

7 putsub(lin, i, m, sub);

8 lastm = m;

9 }

10 }

11 }

Utilized information Liblit05: Ever true? SOBER: What percentage is

true?

Liblit05:

• Line 6 is ever true in most passing and failing exec.

SOBER:

• Prone to be true in failing exec.

• Prone to be false in passing exec.


22

l in

while(lin[i]!=E N DSTR ) m = amatc h(lin, i, pat, 0)

i = 0

if(m >= 0) lastm = m

if(m==-1 || m==i) i = i + 1

i = m

lastm = -1

T-Score: Metric of Debugging Quality


{

int i, lastm, m;

lastm = -1;

i = 0;



if ((m >= 0)){

lastm = m;

}

if ((m == -1) || (m == i)) {

i = i + 1;

} else

i = m;

}

}

T-score = 70%

How close is the blamed to the real bug location?


23

l in

while(lin[i]!=E N DSTR ) m = amatc h(lin, i, pat, 0)

i = 0

if(m >= 0) lastm = m

if(m==-1 || m==i) i = i + 1

i = m

lastm = -1

A Better Debugging Result


{

int i, lastm, m;

lastm = -1;

i = 0;



if ((m >= 0)){

lastm = m;

}

if ((m == -1) || (m == i)) {

i = i + 1;

} else

i = m;

}

}

T-score = 40%


24

Evaluation 1: Siemens Program Suite

T-Score <= 20% is meaningful

Siemens program suite 130 buggy versions

of 7 small (<700LOC) programs

What percentage bugs can be located with no more than % code examination


25

Evaluation 2: Reasonably Large Programs

Bug Type Failure Number T-Score

Flex 2.4.7(8,834 LOC)

Bug 1 Misuse >= for > 163/525 0.5%

Bug 2 Misuse of = for == 356/525 1.6%

Bug 3 Mis-assign value true for false 69/525 7.6%

Bug 4 Mis-parenthesize ((a||b)&&c)as (a || (b && c))

22/525 15.4%

Bug 5 Off-by-one 92/525 45.6%

Grep 2.2(11,826 LOC)

Bug 1 Off-by-one 48/470 0.6%

Bug 2 Subclause-missing 88/470 0.2%

Gzip 1.2(6,184 LOC)



Software-artifact Infrastructure Repository (SIR): http://sir.unl.edu


26

A Glimpse of Bugs in Flex-2.4.7

Bug 1: Misuse >= for >

#ifndef BUG_1if ( performance_report > 0 )

#elseif ( performance_report >= 0 )

#endif

Bug 2: Mis-parenthesize

#ifndef BUG_2if ( (fulltbl || fullspd) && reject )

#elseif ( fulltbl || (fullspd && reject) )

#endif

Bug 3: Misuse != for ==#ifndef BUG_3

if (s - nextchar == my_strlen (p->name))#else

if (s - nextchar != my_strlen (p->name))#endif

Bug 4: Misuse = for ==

#ifndef BUG_4if ( yymore_really_used == REALLY_USED )

#elseif ( yymore_really_used = REALLY_NOT_USED )

#endif

Bug 5: Off-by-one

#ifndef BUG_5chk[offset] = EOB_POSITION;

#elsechk[offset - 1] = EOB_POSITION;

#endif


27

Evaluation 2: Reasonably Large Programs


Flex 2.4.7(8,834 LOC)

Bug 1 Misuse >= for > 163/525 0.5%

Bug 2 Misuse of = for == 356/525 1.6%



22/525 15.4%

Bug 5 Off-by-one 92/525 45.6%

Grep 2.2(11,826 LOC)

Bug 1 Off-by-one 48/470 0.6%


Gzip 1.2(6,184 LOC)



Software-artifact Infrastructure Repository (SIR): http://sir.unl.edu


28

A Close Look: Grep-2.2: Bug 1

static int grep(int fd) { ...541 for( ; ; )542 { ...548 lastnl = bufbeg;549 if (lastout) 550 lastout = bufbeg;

553 beg = bufbeg + save - residue + 1; /* fault 1 */

...574 if (beg != lastout)575 lastout = 0; ... 580 } ... 587 return nlines;588 }

P1470

P1484

• 11,826 lines of C code• 3,136 predicates instrumented• 48 out of 470 cases fail


29

Grep-2.2: Bug 2 static char ** comsubs(char* left, char* right) { ...2264 for(lcp = left; *lcp != '\0'; ++lcp)2265 {

...2268 while (rcp != NULL)2269 {2270 for(i = 1; lcp[i] != '\0' /* && lcp[i] == rcp[i] */; ++i) // fault 22271 continue;

...2275 }

...2280 }2281 return cpp;2282 }

P1952

• 11,826 lines of C code• 3,136 predicates instrumented• 88 out of 470 cases fail


30

No Silver Bullet: Flex Bug 5132 void genctbl() { ...177 for ( i = 0; i <= lastdfa; ++i )178 {179 int anum = dfaacc[i].dfaacc_state;180 int offset = base[i];#ifndef BUG_5

chk[offset] = EOB_POSITION;#else

chk[offset - 1] = EOB_POSITION;#endif183 chk[offset - 1] = ACTION_POSITION;184 nxt[offset - 1] = anum;/* action number */185 } ...224 }

No wrong value in chk[offset -1]

chk[offset] is not used here but later

• 8,834 lines of C code• 2,699 predicates instrumented


31

Experiment Result in Summary Effective for bugs demonstrating abnormal control flows


Flex 2.4.7(8,834 LOC)

Bug 1 Misuse >= for > 163/525 0.5%

Bug 2 Misuse of = for == 356/525 1.6%



22/525 15.4%

Bug 5 Off-by-one 92/525 45.6%

Grep 2.2(11,826 LOC)

Bug 1 Off-by-one 48/470 0.6%


Gzip 1.2(6,184 LOC)




32

SOBER Handles Memory Bugs As Well

void more_variables () { ...127 old_count = v_count; ...137 for (indx = 3; indx < old_count; indx++) ...141 for (; indx < v_count; indx++) ... }

void more_arrays () { ...167 arrays = (bc_var_array **) bc_malloc (a_count * sizeof(bc_var_array*)); ... /* Copy the old arrays. */ for (indx = 1; indx < old_count; indx++) arrays[indx] = old_ary[indx];

176 for (; indx < v_count; indx++) arrays[indx] = NULL; ... }

bc 1.06:

• Two memory bugs found with SOBER

• One of them is unreported

• Blamed location is NOT the crashing venue


33

Outline







34

Major Problems in Failure Triage

Failure Prioritization What failures are likely due to the same

bug What bugs are the most severe Worst 1% bugs = 50% failures

Failure Assignment Which developer should debug which set

of failures?

Courtesy of Microsoft Corporation


35

Failure indexing Identify failures likely due to the same bug

A Solution: Failure Clustering

Most sever

Less Severe

Least Severe

Fault in core.io?

Fault in function

initialize()?

Failure Reports

Failure Reports

++++++

+++++++

++++++

++

+

X

Y

0


36

The Central Question: A Distance Measure between Failures

Different measures render different clusterings

OOO

OOO

+++++++

X

Y

Dist. defined on X-axis Dist. defined on Y-axis

O+X

++ OO OOOOOO +++++++

Y

0

0 0


37

How to Define a Distance

Previous work [Podgurski et al., 2003] T-Proximity: Distance defined on literal trace similarity

,1fail ,2fail mfail3fail

1 2 3 m

= SOBER

Our approach [Liu et al., 2006] R-Proximity: Distance defined on likely bug location


38

3RC

Why Our Approach is Reasonable

Optimal proximity: defined on root causes (RC) Our approach: defined on likely causes (LC)

1RC

,1fail ,2fail mfailF ,1pass ,2pass npass P3fail

2RC mRC+

1fail

mfail+

+2fail

3fail+

X

Y

03RC

= Automated Fault Localization

mLC3LC2LC1LC


39

R-Proximity: An Instantiation with SOBER

Likely causes (LCs) are predicate rankings

Pred2

Pred3

Pred1

Pred6

2Pred2

Pred3

Pred1

Pred6

3Pred2

Pred6

Pred1

Pred3

m

,1fail ,2fail mfail ,1pass ,2pass npass3failF P

A distances between rankings is needed

+2fail

3fail+

+1fail

mfail+

X

Y

0

Pred2

Pred6

Pred1

Pred3

1

SOBER


40

Distance between Rankings

Traditional Kendall’s tau distance Number of preference disagreements E.g.

NOT all predicates need to be considered? Predicates are uniformly instrumented Only fault-relevant predicates count

2),( then ),,,(),,,( 2132122131 DistPPPPPP


41

Predicate Weighting in a Nutshell

Fault-relevant predicates receive higher weights

Fault-relevance is implied by rankings

Pred2

Pred6

Pred1

Pred3

1 2 3 mPred2

Pred6

Pred1

Pred3

Pred2

Pred1

Pred3

Pred6

Pred2

Pred1

Pred3

Pred6 Mostly favored predicates receive

higher weights


42

Automated Failure Assignment

Most-favored predicates indicate the agreed bug location for a group of failures

Predicate spectrum graph

Pred2

Pred6

Pred1

Pred3

1 2 3 4Pred2

Pred6

Pred1

Pred3

Pred2

Pred1

Pred3

Pred6

Pred2

Pred1

Pred3

Pred6

Pred. Index

Y

0 1 3 62 4 5

2

4


43

Case Study 1: Grep-2.2

470 test cases in total 136 cases fail due to both faults, no crashes 48 fail due to Fault 1, 88 fail due to Fault 2

static int grep(int fd) { ...541 for( ; ; )542 { ...548 lastnl = bufbeg;549 if (lastout)550 lastout = bufbeg;551 if (buflim - bufbeg == save)552 break;553 beg = bufbeg + save - residue + 1; /* fault 1 */554 for(lim = buflim; lim > beg && lim[-1] != '\n'; --lim)555 ; ...574 if (beg != lastout)57 5 lastout = 0;576 save = residue + lim - beg; ...580 } ...587 return nlines;588 }

Fault 1: A n off-by-one error in grep.c

static char ** comsubs(char* left, char* right) { ...2264 for(lcp = left; *lcp != '\0'; ++lcp)2265 {2266 len = 0;2267 rcp = index(right, *lcp);2268 while (rcp != NULL)2269 { /* fault 2 */2270 for (i = 1; lcp[i] != '\0'/* && lcp[i] == rcp[i] */; ++i)2271 continue;2272 if (i > len)2273 len = i;2274 rcp = index(rcp + 1, *lcp);2275 }2276 if (len == 0)2277 continue;2278 if ((cpp = enlist(cpp, lcp, len)) == NULL)2279 break;2280 }2281 return cpp;2282 }

Fault 2: A subc lause-miss ing error in dfa.c

P 1470

P 1484

P 1952


44

Failure Proximity Graphs

Red crosses are failures due to Fault 1 Blue circles are failures due to Fault 2 Divergent behaviors due to the same fault Better clustering result under R-Proximity

T-Proximity R-Proximity


45

Guided Failure Assignment What predicates are favored in each

group?


46

Assign Failures to Appropriate Developers

The 21 failing cases in Cluster 1 are assigned to developers responsible for the function grep

The 112 failing cases in Cluster 2 are assigned to developers responsible for the function comsub

static int grep(int fd) { ...541 for( ; ; )542 { ...548 lastnl = bufbeg;549 if (lastout)550 lastout = bufbeg;551 if (buflim - bufbeg == save)552 break;553 beg = bufbeg + save - residue + 1; /* fault 1 */554 for(lim = buflim; lim > beg && lim[-1] != '\n'; --lim)555 ; ...574 if (beg != lastout)57 5 lastout = 0;576 save = residue + lim - beg; ...580 } ...587 return nlines;588 }

Fault 1: A n off-by-one error in grep.c

static char ** comsubs(char* left, char* right) { ...2264 for(lcp = left; *lcp != '\0'; ++lcp)2265 {2266 len = 0;2267 rcp = index(right, *lcp);2268 while (rcp != NULL)2269 {2270 for (i = 1; lcp[i] != '\0'/* && lcp[i] == rcp[i] */; ++i) /* fault 2 */2271 continue;2272 if (i > len)2273 len = i;2274 rcp = index(rcp + 1, *lcp);2275 }2276 if (len == 0)2277 continue;2278 if ((cpp = enlist(cpp, lcp, len)) == NULL)2279 break;2280 }2281 return cpp;2282 }

Fault 2: A subc lause-miss ing error in dfa.c

P 1470

P 1484

P 1952


47

Case Study 2: Gzip-1.2.3

217 test cases in total 82 cases fail due to both faults, no crashes 65 fail due to Fault 1, 17 fail due to Fault 2

661 ulg deflate() { ...//Fault 1686 if (hash_head != NIL /* && prev_length < max_lazy_match */ 687 && strstart - hash_head <= MAX_DIST ) { ... 692 match_length = longest_match (hash_head); }707 if (prev_length >= MIN_MATCH && match_length <= prev_length) { ...711 flush = ct_tally(strstart-1-prev_match, prev_length - MIN_MATCH); ...719 strstart++; ...732 } else if (match_available) { ...738 if (ct_tally (0, window[strstart-1])) {

...741 strstart++;743 } else { ...750 } }

580 local ulg deflate_fast() { ... // Fault 2596 if (hash_head != NIL /*&& strstart - hash_head <= MAX_DIST*/) {601 match_length = longest_match (hash_head); }605 if (match_length >= MIN_MATCH) { ...608 flush = ct_tally(strstart-match_start, match_length - MIN_MATCH);610 lookahead -= match_length;

615 if (match_length <= max_insert_length) { ...626 strstart++; 635 }636 } else { ...639 flush = ct_tally (0, window[strstart]); ...642 }653 return FLUSH_BLOCK(1); /* eof */654 }


48

Failure Proximity Graphs

Red crosses are for failures due to Fault 1 Blue circles are for failures due to Fault 2 Nearly perfect clustering under R-Proximity Accurate failure assignment

T-Proximity R-Proximity


49

Outline







50

Mining Copy-Paste Bugs

Copy-pasting is common 12% in Linux file

system [Kasper2003]

19% in X Window system [Baker1995]

Copy-pasted code is error prone Among 35 errors in

Linux drivers/i2o, 34 are caused by copy-paste [Chou2001]

Forget to change!

void __init prom_meminit(void){ …… for (i=0; i<n; i++) { total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1]; } ……

for (i=0; i<n; i++) { taken[i].adr = list[i].addr; taken[i].bytes = list[i].size; taken[i].more = &total[i+1]; }

for (i=0; i<n; i++) { total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1]; }

(Simplified example from linux-2.6.6/arch/sparc/prom/memory.c)


51

An Overview of Copy-Paste Bug Detection

Parse source code &build a sequence database

Mine for basic copy-pasted segments

Compose larger copy-pasted segments

Prune false positives


52

Parsing Source Code

5 61 20

old = 3;Tokenize

Hash16

Purpose: building a sequence database Idea: statement number

Tokenize each component Different operators/constant/key words different tokens

Handle identifier renaming: same type of identifiers same token

new = 3;

5 61 20

Hash16


53

Building Sequence Database

Program a long sequence Need a sequence

database

Cut the long sequence Naïve method: fixed

length Our method: basic block

for (i=0; i<n; i++) { total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1];} ……

for (i=0; i<n; i++) { taken[i].adr = list[i].addr; taken[i].bytes = list[i].size; taken[i].more = &total[i+1];}

65161671

…

65161671

Final sequence DB:(65)(16, 16, 71)…(65)(16, 16, 71)

Hash values


54

total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1];

taken[i].adr = list[i].addr; taken[i].bytes = list[i].size; taken[i].more = &total[i+1];

Mining for Basic Copy-pasted Segments

Apply frequent sequence mining algorithm on the sequence database

Modification Constrain the max gap

(16, 16, 71)……(16, 16, 10, 71)

(16, 16, 71)……(16, 16, 71)

Frequent subsequence Insert 1 statement

(gap = 1)


55

Composing Larger Copy-Pasted Segments

Combine the neighboring copy-pasted segments repeatedly

for (i=0; i<n; i++) {

total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1];

for (i=0; i<n; i++) {

taken[i].adr = list[i].addr; taken[i].bytes = list[i].size; taken[i].more = &total[i+1];

for (i=0; i<n; i++) { taken[i].adr = list[i].addr; taken[i].bytes = list[i].size; taken[i].more = &total[i+1];}

for (i=0; i<n; i++) { total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1];}

65

65

161671

161671

65161671

65161671

Hash values

copy-pasted

……

combine

combine


56

Pruning False Positives

Unmappable segments Identifier names cannot be mapped to

corresponding ones

Tiny segments

For more detail, see Zhenmin Li, Shan Lu, Suvda Myagmar, Yuanyuan Zhou.

CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code, in Proc. 6th Symp. Operating Systems Design and Implementation, 2004

f (a1);f (a2);f (a3);

f1 (b1);f1 (b2);f2 (b3);

conflict


57

Some Test Results of C-P Bug Detection

Software Verified Bugs Potential Bugs (careless programming)

Linux 28 21

FreeBSD 23 8

Apache 5 0

PostgreSQL 2 0

Software # LOC

Linux 4.4 M

FreeBSD 3.3 M

Apache 224 K

PostgreSQL 458 K

Space (MB)TimeSoftware

5738 secsPostgreSQL

3015 secsApache

45920 minsFreeBSD

52720 minsLinux


58

Outline







59

Conclusions

Data mining into software and computer systems Identify incorrect executions from program

runtime behaviors Classification dynamics can give away “backtrace”

for noncrashing bugs without any semantic inputs A hypothesis testing-like approach is developed to

localize logic bugs in software No prior knowledge about the program semantics

is assumed Lots of other software bug mining methods should

be and explored


60

Future Research: Mining into Computer Systems

Huge volume of data from computer systems Persistent state interactions, event logs,

network logs, CPU usage, … Mining system data for …

Reliability Performance Manageability …

Challenges in data mining Statistical modeling of computer systems Online, scalability, interpretability …


61

References [DRL+98] David L. Detlefs, K. Rustan, M. Leino, Greg Nelson and James B. Saxe. Extended

static checking, 1998 [EGH+94] David Evans, John Guttag, James Horning, and Yang Meng Tan. LCLint: A tool for

using specifications to check code. In Proceedings of the ACM SIG-SOFT '94 Symposium on the Foundations of Software Engineering, pages 87-96, 1994.

[DLS02] Manuvir Das, Sorin Lerner, and Mark Seigle. Esp: Path-sensitive program verication in polynomial time. In Conference on Programming Language Design and Implementation, 2002.

[ECC00] D.R. Engler, B. Chelf, A. Chou, and S. Hallem. Checking system rules using system-specic, programmer-written compiler extensions. In Proc. 4th Symp. Operating Systems Design and Implementation, October 2000.

[M93] Ken McMillan. Symbolic Model Checking. Kluwer Academic Publishers, 1993 [H97] Gerard J. Holzmann. The model checker SPIN. Software Engineering, 23(5):279-295,

1997. [DDH+92] David L. Dill, Andreas J. Drexler, Alan J. Hu, and C. Han Yang. Protocol verication as

a hardware design aid. In IEEE Int. Conf. Computer Design: VLSI in Computers and Processors, pages 522-525, 1992.

[MPC+02] M. Musuvathi, D. Y.W. Park, A. Chou, D. R. Engler and D. L. Dill. CMC: A Pragmatic Approach to Model Checking Real Code. In Proc. 5th Symp. Operating Systems Design and Implementation, 2002.


62

References (cont’d)

[G97] P. Godefroid. Model Checking for Programming Languages using VeriSoft. In Proc. 24th ACM Symp. Principles of Programming Languages, 1997

[BHP+-00] G. Brat, K. Havelund, S. Park, and W. Visser. Model checking programs. In IEEE Int.l Conf. Automated Software Engineering (ASE), 2000.

[HJ92] R. Hastings and B. Joyce. Purify: Fast Detection of Memory Leaks and Access Errors. 1991. in Proc. Winter 1992 USENIX Conference, pp. 125-138. San Francisco, California

Chao Liu, Xifeng Yan, and Jiawei Han, “Mining Control Flow Abnormality for Logic Error Isolation,” in Proc. 2006 SIAM Int. Conf. on Data Mining (SDM'06), Bethesda, MD, April 2006.

C. Liu, X. Yan, L. Fei, J. Han, and S. Midkiff, “SOBER: Statistical Model-based Bug Localization”, in Proc. 2005 ACM SIGSOFT Symp. Foundations of Software Engineering (FSE 2005), Lisbon, Portugal, Sept. 2005.

C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu, “Mining Behavior Graphs for Backtrace of Noncrashing Bugs”, in Proc. 2005 SIAM Int. Conf. on Data Mining (SDM'05), Newport Beach, CA, April 2005.

[SN00] Julian Seward and Nick Nethercote. Valgrind, an open-source memory debugger for x86-GNU/Linux http://valgrind.org/

[LLM+04] Zhenmin Li, Shan Lu, Suvda Myagmar, Yuanyuan Zhou. CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code, in Proc. 6th Symp. Operating Systems Design and Implementation, 2004

[LCS+04] Zhenmin Li, Zhifeng Chen, Sudarshan M. Srinivasan, Yuanyuan Zhou. C-Miner: Mining Block Correlations in Storage Systems. In pro. 3rd USENIX conf. on file and storage technologies, 2004


63


64

Surplus Slides

The remaining are leftover slides


65

Representative Publications

Chao Liu, Long Fei, Xifeng Yan, Jiawei Han and Samuel Midkiff, “Statistical Debugging: A Hypothesis Testing-Based Approach,” IEEE Transaction on Software Engineering, Vol. 32, No. 10, pp. 831-848, Oct., 2006.

Chao Liu and Jiawei Han, “R-Proximity: Failure Proximity Defined via Statistical Debugging,” IEEE Transaction on Software Engineering, Sept. 2006. (under review)

Chao Liu, Zeng Lian and Jiawei Han, "How Bayesians Debug", the 6th IEEE International Conference on Data Mining, pp. pp. 382-393,Hong Kong, China, Dec. 2006.

Chao Liu and Jiawei Han, "Failure Proximity: A Fault Localization-Based Approach", the 14th ACM SIGSOFT Symposium on the Foundations of Software Engineering, pp. 286-295, Portland, USA, Nov. 2006.

Chao Liu, "Fault-aware Fingerprinting: Towards Mutualism between Failure Investigation and Statistical Debugging", the 14th ACM SIGSOFT Symposium on the Foundations of Software Engineering, Portland, USA, Nov. 2006.

Chao Liu, Chen Chen, Jiawei Han and Philip S. Yu, "GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis", the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 872-881, Philadelphia, USA, Aug. 2006.

Qiaozhu Mei, Chao Liu, Hang Su and Chengxiang Zhai, "A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs", the 15th International Conference on World Wide Web, pp. 533-542, Edinburgh, Scotland, May, 2006.

Chao Liu, Xifeng Yan and Jiawei Han, "Mining Control Flow Abnormality for Logic Error Isolation", 2006 SIAM International Conference on Data Mining, pp. 106-117, Bethesda, US, April, 2006.

Chao Liu, Xifeng Yan, Long Fei, Jiawei Han and Samuel Midkiff, "SOBER: Statistical Model-Based Bug Localization", the 5th joint meeting of the European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering, pp. 286-295, Lisbon, Portugal, Sept. 2005.

William Yurcik and Chao Liu. "A First Step Toward Detecting SSH Identity Theft on HPC Clusters: Discriminating Cluster Masqueraders Based on Command Behavior" the 5th International Symposium on Cluster Computing and the Grid, pp. 111-120, Cardiff, UK, May 2005.

Chao Liu, Xifeng Yan, Hwanjo Yu, Jiawei Han and Philip S. Yu, "Mining Behavior Graphs for "Backtrace" of Noncrashing Bugs", In Proc. 2005 SIAM Int. Conf. on Data Mining, pp. 286-297, Newport Beach, US, April, 2005.


66


{

int i, lastm, m;

lastm = -1;

i = 0;



if (m > 0){

putsub(lin, i, m, sub);

lastm = m;

}

if ((m == -1) || (m == i)){

fputc(lin[i], stdout);

i = i + 1;

} else

i = m;

}

}


{

int i, lastm, m;

lastm = -1;

i = 0;



if (m >= 0){


lastm = m;

}

if ((m == -1) || (m == i)){


i = i + 1;

} else

i = m;

}

}

Example of Noncrashing Bugs


{

int i, lastm, m;

lastm = -1;

i = 0;



if (m >= 0){


lastm = m;

}

if ((m == -1) || (m == i)){


i = i + 1;

} else

i = m;

}

}


{

int i, lastm, m;

lastm = -1;

i = 0;



if ((m >= 0) && (lastm != m) ){


lastm = m;

}

if ((m == -1) || (m == i)){


i = i + 1;

} else

i = m;

}

}


67

Debugging Crashes

Crashing Bugs


68

Bug Localization via Backtrace

Can we circle out the backtrace for noncrashing bugs?

Major challenges

We do not know where abnormality happens

Observations

Classifications depend on discriminative features,

which can be regarded as a kind of abnormality

Can we extract backtrace from classification

results?


69

Outline

Motivation Related Work Classification of Program Executions Extract “Backtrace” from Classification

Dynamics Mining Control Flow Abnormality for Logic

Error Isolation CP-Miner: Mining Copy-Paste Bugs Conclusions


70

Related Work

Crashing bugs Memory access monitoring

Purify [HJ92], Valgrind [SN00] …

Noncrashing bugs Static program analysis Traditional model checking Model checking source code


71

Static Program Analysis

Methodology Examine source code directly Enumerate all the possible execution paths without

running the program Check user-specified properties, e.g.

free(p) …… (*p) lock(res) …… unlock(res) receive_ack() … … send_data()

Strengths Check all possible execution paths

Problems Shallow semantics Properties can be directly mapped to source code

structure Tools

ESC [DRL+98], LCLint [EGH+94], ESP [DLS02], MC Checker [ECC00] …

×


72

Traditional Model Checking

Methodology Formally model the system under check in a

particular description language Exhaustive exploration of the reachable states

in checking desired or undesired properties Strengths

Model deep semantics Naturally fit in checking event-driven systems,

like protocols Problems

Significant amount of manual efforts in modeling

State space explosion Tools

SMV [M93], SPIN [H97], Murphi [DDH+92] …


73

Model Checking Source Code

Methodology Run real program in sandbox Manipulate event happenings, e.g.,

Message incomings the outcomes of memory allocation

Strengths Less significant manual specification

Problems Application restrictions, e.g.,

Event-driven programs (still) Clear mapping between source code and logic event

Tools CMC [MPC+02], Verisoft [G97], Java PathFinder

[BHP+-00] …


74

Summary of Related Work

In common, Semantic inputs are necessary

Program model Properties to check

Application scenarios Shallow semantics Event-driven system


75

Outline





76


{

int i, lastm, m;

lastm = -1;

i = 0;



if (m > 0){


lastm = m;

}

if ((m == -1) || (m == i)){


i = i + 1;

} else

i = m;

}

}


{

int i, lastm, m;

lastm = -1;

i = 0;



if (m >= 0){


lastm = m;

}

if ((m == -1) || (m == i)){


i = i + 1;

} else

i = m;

}

}

Example Revisited


{

int i, lastm, m;

lastm = -1;

i = 0;



if (m >= 0){


lastm = m;

}

if ((m == -1) || (m == i)){


i = i + 1;

} else

i = m;

}

}


{

int i, lastm, m;

lastm = -1;

i = 0;



if ((m >= 0) && (lastm != m) ){


lastm = m;

}

if ((m == -1) || (m == i)){


i = i + 1;

} else

i = m;

}

}

• No memory violations

• Not event-driven program

• No explicit error properties


77

Identification of Incorrect Executions

A two-class classification problem How to abstract program executions

Program behavior graph Feature selection

Edges + Closed frequent subgraphs Program behavior graphs

Function-level abstraction of program behaviorsint main(){ ... A(); ... B();}int A(){ ... }int B(){ ... C() ... }int C(){ ... }


78

Values of Classification

A graph classification problem Every execution gives one behavior graph Two sets of instances: correct and incorrect

Values of classification Classification itself does not readily work for bug

localization Classifier only labels each run as either correct or incorrect as a

whole It does not tell when abnormality happens

Successful classification relies on discriminative features

Can discriminative features be treated as a kind of abnormality?

When abnormality happens? Incremental classification?

?


79

Outline





80

Incremental Classification

Classification works only when instances of two classes are different.

So that we can use classification accuracy as a measure of difference.

Relate classification dynamics to bug relevant functions


81

Illustration: Precision Boost

main main

A A

B C

D

B C

D

One Correct Execution One Incorrect Execution

E E

F

G

F

G

H


82

Bug Relevance

Precision boost For each function F:

Precision boost = Exit precision - Entrance precision.

Intuition Differences take place within the execution

of F Abnormalities happens while F is in the

stack The larger this precision boost, the more

likely F is part of the backtrace Bug-relevant function


83

Outline

Related Work Classification of Program Executions Extract “Backtrace” from Classification Dynamics Case Study Conclusions


84

Case Study

Subject program replace: perform regular

expression matching and substitutions

563 lines of C code 17 functions are involved

Execution behaviors 130 out of 5542 test cases

fail to give correct outputs No incorrect executions

incur segmentation faults Logic bug

Can we circle out the backtrace for this bug?


{

int i, lastm, m;

lastm = -1;

i = 0;



if (m >= 0){


lastm = m;

}

if ((m == -1) || (m == i)){


i = i + 1;

} else

i = m;

}

}


{

int i, lastm, m;

lastm = -1;

i = 0;



if ((m >= 0) && (lastm != m) ){


lastm = m;

}

if ((m == -1) || (m == i)){


i = i + 1;

} else

i = m;

}

}


85

Precision Pairs


86

Precision Boost Analysis

Objective judgment of bug relevant functions

main function is always bug relevant

Stepwise precision boost

Line-up property


87

Backtrace for Noncrashing Bugs


88

Method Summary

Identify incorrect executions from program runtime

behaviors

Classification dynamics can give away “backtrace”

for noncrashing bugs without any semantic inputs

Data mining can contribute to software

engineering and system researches in general


89

Outline





90

An Example

Replace program: 563 lines of C code, 20 functions Symptom: 30 out of 5542 test cases fail to give correct outputs,

and no crashes Goal: Localizing the bug, and prioritizing manual examination

void dodash(char delim, char *src, int *i, char *dest, int *j, int maxset){ while (…){ … if(isalnum(isalnum(src[*i+1]) && src[*i-1]<=src[*i+1]){

for(k = src[*i-1]+1; k<=src[*i+1]; k++)junk = addst(k, dest, j, maxset);

*i = *i + 1;}*i = *i + 1;

}}


91

Difficulty & Expectation

Difficulty Statically, even small programs are

complex due to dependencies Dynamically, execution paths can vary

significantly across all possible inputs Logic errors have no apparent

symptoms Expectations

Unrealistic to fully unload developers Localize buggy region Prioritize manual examination


92

Execution Profiling

Full execution trace Control flow + value tags Too expensive to record at runtime Unwieldy to process

Summarized control flow for conditionals (if, while, for) Branch evaluation counts Lightweight to take at runtime Easy to process and effective


93

Analysis of the Example

A = isalnum(isalnum(src[*i+1]))

B = src[*i-1]<=src[*i+1] An execution is logically correct until (A ^ ¬B) is evaluated

as true when the evaluation reaches this condition If we monitor the program conditionals like A here, their

evaluation will shed light on the hidden error and can be exploited for error isolation

if(isalnum(isalnum(src[*i+1]) && src[*i-1]<=src[*i+1]){

for(k = src[*i-1]+1; k<=src[*i+1]; k++)

junk = addst(k, dest, j, maxset);

*i = *i + 1; }


94

Analysis of Branching Actions

Correct vs. in correct runs in program P

AS we tested through 5542 test cases, the true eval prob for (A^¬B) is 0.727 in a correct and 0.896 in an incorrect execution on average

Error location does exhibit detectable abnormal behaviors in incorrect executions

A ¬A

B nAB n¬AB

¬B nA¬B = 0 n¬A¬B

A ¬A

B nAB n¬AB

¬B nA¬B ≥1 n¬A¬B


95

Conditional Test Works for Nonbranching Errors

Off-by-one error can still be detected using the conditional tests

Void makepat (char *arg, int start, char delim, char *pat)

{

…

if (!junk)

result = 0;

else

result = i + 1; /* off-by-one error */

/* should be: result = i */

return result;

}


96

Ranking Based on Boolean Bias

Let input di has a desired output oi. We execute P. P passes the test iff oi’ is identical to oi

Tp = {ti| oi’= P(di) matches oi}

Tf = {ti| oi’= P(di) does not match oi} Boolean bias:

nt: # times that a boolean feature B evaluates true, similar for nf

Boolean bias: π(B) = (nt – nf )/(nt + nf) It encodes the distribution of B’s value: 1 if B

always assumes true, -1 if always false, in between for all the other mixtures


97

Evaluation Abnormality

Boolean bias for branch P the probability of being evaluated as true within

one execution Suppose we have n correct and m incorrect

executions, for any predicate P, we end up with An observation sequence for correct runs

S_p = (X’_1, X’_2, …, X’_n) An observation sequence for incorrect runs

S_f = (X_1, X_2, …, X_m) Can we infer whether P is suspicious based on S_p

and S_f?


98

Underlying Populations

Imagine the underlying distribution of boolean bias for correct and incorrect executions are f(X|θp) and f(X|θf)

S_p and S_f can be viewed as random sample from the underlying populations respectively

Major heuristic: The larger the divergence between f(X|θp) and f(X|θf), the more relevant the branch P is to the bug

0 1

Prob

Evaluation bias 0 1

Prob

Evaluation bias


99

Major Challenges

No knowledge of the closed forms of both distributions

Usually, we do not have sufficient incorrect executions to estimate f(X|θf) reliably.

0 1

Prob

Evaluation bias0 1

Prob

Evaluation bias


100

Our Approach: Hypothesis Testing


101

Faulty Functions

Motivation Bugs are not necessarily on branches Higher confidence in function rankings

than branch rankings Abnormality score for functions

Calculate the abnormality score for each branch within each function

Aggregate them


102

Two Evaluation Measures

CombineRank Combine these score by summation Intuition: When a function contains many

abnormal branches, it is likely bug-relevant UpperRank

Choose the largest score as the representative

Intuition: When a function has one extremely abnormal branch, it is likely bug-relevant


103

Dodash vs. Omatch: Which function is likely buggy?─And Which Measure is More

Effective?


104

Bug Benchmark

Bug benchmark Siemens Program Suite

89 variants of 6 subject programs, each of 200-600 LOC

89 known bugs in total Mainly logic (or semantic) bugs

Widely used in software engineering research


105

Results on Program “replace”


106

Comparison between CombineRank and UpperRank

Buggy function ranked within top-k


107

Results on Other Programs


108

More Questions to Be Answered

What will happen (i.e., how to handle) if multiple errors exist in one program?

How to detect bugs if only very few error test cases are available?

Is it really more effective if we have more execution traces?

How to integrate program semantics in this statistics-based testing algorithm?

How to integrate program semantics analysis with statistics-based analysis?