defect prediction

8/7/2019 defect prediction

1/29

Introduction to Defect

Prediction

Cmpe 589

Spring 2008


2/29


3/29

Problem 2

How hard will it befor anotherorganization tomaintain this

software? McCabe Complexity


4/29


5/29

Problem Definition Software development

lifecycle: Requirements Design Development Test (Takes ~50% of overall time)

Detect and correctdefects before

delivering software. Test strategies:

Expert judgment Manual code reviews Oracles/ Predictors as secondary

tools


6/29

Problem Definition


7/29

Testing


8/29

Defect Prediction

2-Class Classification Problem.

Non-defective If error = 0

Defective If error > 0

2 things needed: Raw data: Source code Software Metrics -> Static Code

Attributes


9/29


10/29

Static Code Attributes void main() { //This is a sample code

//Declare variables int a, b, c;

// Initialize variables a=2; b=5;

//Find the sum and display c if greaterthan zero

c=sum(a,b); if c < 0 printf(%d\n, a); return; }

int sum(int a, int b) { // Returns the sum of two numbers return a+b; }

c > 0

c

Module LOC LOCC V CCError

main() 16 4 5 2 2

sum() 5 1 3 1 0

LOC: Line of Code

LOCC: Line ofcommented Code

V: Number of unique operands&operators

CC: Cyclometric Complexity


11/29


12/29


13/29

+


14/29

Defect Prediction

Machine Learning based models.

Defect density estimation

Regression models: error pronness

First classification then regression

Defect prediction between versions Defect prediction for embedded systems


15/29

Constructing Predictors

Baseline: Naive Bayes.

Why?: Best reported results so far(Menzies et al., 2007)

Remove assumptions and construct

different models. Independent Attributes ->Multivariate dist.

Attributes of equal importance


16/29

Weighted Naive Bayes

))(log(2

1)(

2

1

i

d

j j

ij

t

j

i CPs

mxxg

!

!

Naive Bayes

Weighted Naive Bayes ))(log(2

1)(

2

1

i

d

j j

ij

t

j

ji CPs

mxwxg

!

!


17/29

Datasets

Name # Features #Modules Defect Rate(%)

CM1 38 505 9

PC1 38 1107 6

PC2 38 5589 0.6

PC3 38 1563 10

PC4 38 1458 12

KC3 38 458 9

KC4 38 125 40

MW1 38 403 9


18/29

Performance Measures

DefectsActual

no yes

Prdno A B

yes C D

Accuracy: (A+D)/(A+B+C+D)

Pd (Hit Rate): D / (B+D)

Pf (False Alarm Rate): C / (A+C)


19/29


20/29

Results: InfoGain&GainRatio

DataWNB+IG (%) WNB+GR (%) IG+NB (%)

pd pf bal pd pf bal pd pf balCM1 82 39 70 82 39 70 83 32 74

PC1 69 35 67 69 35 67 40 12 57

PC2 72 15 77 66 20 72 72 15 77

PC3 80 35 71 81 35 72 60 15 70

PC4 88 27 79 87 24 81 92 29 78

KC3 80 27 76 83 30 76 48 15 62

KC4 77 35 70 78 35 71 79 33 72

MW1 70 38 66 68 34 67 44 07 60

Avg: 77 31 72 77 32 72 65 20 61


21/29

Results: Weight Assignments

0 5 10 15 20 25 30 35 400

2

4

6

8

10

12

14

16

Enumerated Metrics

CumilativeG

ainRatioFeatureWeights

GainRatio Weights

CM1

PC1

PC2

PC3

PC4

KC1

KC3

MW1


22/29

Benefiting from defect data in practice

Within Company vs Cross Company Data

Investigated in cost estimation literature

No studies in defect prediction!

No conclusions in cost estimation

Straight forward interpretation of results indefect prediction.

Possible reason: well defined features.


23/29

How much data do we need?

Consider:

Dataset size:1000

Defect rate: 8%

Training instances: %90

1000*8%*90%=72 defective instances

(1000-72) non-defective instances


24/29

Intelligent data sampling

With random sampling of 100 instanceswe can learn as well as thousands.

Can we increase the performance withwiser sampling strategies?

Which data?

Practical aspects: Industrial case study.


25/29

ICSOFT07

WC vs CC Data? When to use WC or CC?

How much data do we need to construct a

model?


26/29

ICSOFT07


27/29


28/29

Module Structure vs Defect Rate

Fan-in, fan-out Page Rank Algorithm

Call graph information on the code

small is beautiful


29/29

Performance vs. Granularity

0

20

40

60

80

100

120

Statement Method Class File Component Project

Performance

Granularity

Documents

defect prediction