defect prediction

Embed Size (px)

Citation preview

  • 8/7/2019 defect prediction

    1/29

    Introduction to Defect

    Prediction

    Cmpe 589

    Spring 2008

  • 8/7/2019 defect prediction

    2/29

  • 8/7/2019 defect prediction

    3/29

    Problem 2

    How hard will it befor anotherorganization tomaintain this

    software? McCabe Complexity

  • 8/7/2019 defect prediction

    4/29

  • 8/7/2019 defect prediction

    5/29

    Problem Definition Software development

    lifecycle: Requirements Design Development Test (Takes ~50% of overall time)

    Detect and correctdefects before

    delivering software. Test strategies:

    Expert judgment Manual code reviews Oracles/ Predictors as secondary

    tools

  • 8/7/2019 defect prediction

    6/29

    Problem Definition

  • 8/7/2019 defect prediction

    7/29

    Testing

  • 8/7/2019 defect prediction

    8/29

    Defect Prediction

    2-Class Classification Problem.

    Non-defective If error = 0

    Defective If error > 0

    2 things needed: Raw data: Source code Software Metrics -> Static Code

    Attributes

  • 8/7/2019 defect prediction

    9/29

  • 8/7/2019 defect prediction

    10/29

    Static Code Attributes void main() { //This is a sample code

    //Declare variables int a, b, c;

    // Initialize variables a=2; b=5;

    //Find the sum and display c if greaterthan zero

    c=sum(a,b); if c < 0 printf(%d\n, a); return; }

    int sum(int a, int b) { // Returns the sum of two numbers return a+b; }

    c > 0

    c

    Module LOC LOCC V CCError

    main() 16 4 5 2 2

    sum() 5 1 3 1 0

    LOC: Line of Code

    LOCC: Line ofcommented Code

    V: Number of unique operands&operators

    CC: Cyclometric Complexity

  • 8/7/2019 defect prediction

    11/29

  • 8/7/2019 defect prediction

    12/29

  • 8/7/2019 defect prediction

    13/29

    +

  • 8/7/2019 defect prediction

    14/29

    Defect Prediction

    Machine Learning based models.

    Defect density estimation

    Regression models: error pronness

    First classification then regression

    Defect prediction between versions Defect prediction for embedded systems

  • 8/7/2019 defect prediction

    15/29

    Constructing Predictors

    Baseline: Naive Bayes.

    Why?: Best reported results so far(Menzies et al., 2007)

    Remove assumptions and construct

    different models. Independent Attributes ->Multivariate dist.

    Attributes of equal importance

  • 8/7/2019 defect prediction

    16/29

    Weighted Naive Bayes

    ))(log(2

    1)(

    2

    1

    i

    d

    j j

    ij

    t

    j

    i CPs

    mxxg

    !

    !

    Naive Bayes

    Weighted Naive Bayes ))(log(2

    1)(

    2

    1

    i

    d

    j j

    ij

    t

    j

    ji CPs

    mxwxg

    !

    !

  • 8/7/2019 defect prediction

    17/29

    Datasets

    Name # Features #Modules Defect Rate(%)

    CM1 38 505 9

    PC1 38 1107 6

    PC2 38 5589 0.6

    PC3 38 1563 10

    PC4 38 1458 12

    KC3 38 458 9

    KC4 38 125 40

    MW1 38 403 9

  • 8/7/2019 defect prediction

    18/29

    Performance Measures

    DefectsActual

    no yes

    Prdno A B

    yes C D

    Accuracy: (A+D)/(A+B+C+D)

    Pd (Hit Rate): D / (B+D)

    Pf (False Alarm Rate): C / (A+C)

  • 8/7/2019 defect prediction

    19/29

  • 8/7/2019 defect prediction

    20/29

    Results: InfoGain&GainRatio

    DataWNB+IG (%) WNB+GR (%) IG+NB (%)

    pd pf bal pd pf bal pd pf balCM1 82 39 70 82 39 70 83 32 74

    PC1 69 35 67 69 35 67 40 12 57

    PC2 72 15 77 66 20 72 72 15 77

    PC3 80 35 71 81 35 72 60 15 70

    PC4 88 27 79 87 24 81 92 29 78

    KC3 80 27 76 83 30 76 48 15 62

    KC4 77 35 70 78 35 71 79 33 72

    MW1 70 38 66 68 34 67 44 07 60

    Avg: 77 31 72 77 32 72 65 20 61

  • 8/7/2019 defect prediction

    21/29

    Results: Weight Assignments

    0 5 10 15 20 25 30 35 400

    2

    4

    6

    8

    10

    12

    14

    16

    Enumerated Metrics

    CumilativeG

    ainRatioFeatureWeights

    GainRatio Weights

    CM1

    PC1

    PC2

    PC3

    PC4

    KC1

    KC3

    MW1

  • 8/7/2019 defect prediction

    22/29

    Benefiting from defect data in practice

    Within Company vs Cross Company Data

    Investigated in cost estimation literature

    No studies in defect prediction!

    No conclusions in cost estimation

    Straight forward interpretation of results indefect prediction.

    Possible reason: well defined features.

  • 8/7/2019 defect prediction

    23/29

    How much data do we need?

    Consider:

    Dataset size:1000

    Defect rate: 8%

    Training instances: %90

    1000*8%*90%=72 defective instances

    (1000-72) non-defective instances

  • 8/7/2019 defect prediction

    24/29

    Intelligent data sampling

    With random sampling of 100 instanceswe can learn as well as thousands.

    Can we increase the performance withwiser sampling strategies?

    Which data?

    Practical aspects: Industrial case study.

  • 8/7/2019 defect prediction

    25/29

    ICSOFT07

    WC vs CC Data? When to use WC or CC?

    How much data do we need to construct a

    model?

  • 8/7/2019 defect prediction

    26/29

    ICSOFT07

  • 8/7/2019 defect prediction

    27/29

  • 8/7/2019 defect prediction

    28/29

    Module Structure vs Defect Rate

    Fan-in, fan-out Page Rank Algorithm

    Call graph information on the code

    small is beautiful

  • 8/7/2019 defect prediction

    29/29

    Performance vs. Granularity

    0

    20

    40

    60

    80

    100

    120

    Statement Method Class File Component Project

    Performance

    Granularity