Upload
daniel-clyde-mitchell
View
214
Download
2
Embed Size (px)
Citation preview
An Automated Approach to Predict Effectiveness of Fault Localization Tools
Tien-Duy B. Le, and David LoSchool of Information Systems
Singapore Management University
Will Fault Localization Work For These Failures ?
29th IEEE International Conference on Software Maintenance
Fault Localization Tool: A Primer
Give me a failing program
I have calculated the most suspicious location of bugs
Running
…
1)
2)
3)
4)
OK! I will check your suggestion
Debugging
My program failed
3
Will Fault Localization Tools Really Work?
• In ideal case:– Faulty statements are within a few suspicious
statements, e.g. 10, 20, 30 …
…
1)
2)
3)
4)
I found the bugDebugging
Effectiv
e
4
Will Fault Localization Tools Really Work?
• In the worst case:– Faulty statements cannot be found early in
the ranked list of statements– Time consuming
…
1)
2)
3)
4)
Debugging Forever Debugging
Effectiv
e
5
Will Fault Localization Tools Really Work?
• We build an oracle to predict if an output of a fault localization tool (i.e., instance) can be trusted or not.
• If not trusted– Developers do not have to spend time using the output– Developers can revert to manual debugging
Trusted or not ?
Oracle Ball
6
Overall Framework
Suspiciousness Scores
Spectra
Fault Localization
Feature Extraction
Model Learning1 2
Effectiveness Labels
Training Stage
Model
7
Overall Framework
• Major Components:– Feature Extraction
• 50 features, 6 categories– Model Learning
• We extend Support Vector Machine (SVM) to handle imbalanced training data.
8
Feature Extraction
Traces (5 Features)T1 # traces
T2 # failing traces
T3 # successful traces
…
Program Elements (10 Features)PE1 # program elements in failing traces
PE2 # program elements in correct traces
PE3 PE2 – PE1
…
9
Feature Extraction
Raw Scores (10 Features)R1 Highest suspiciousness scores
R2 Second highest suspiciousness scores
Ri ith highest suspiciousness scores
…
Simple Statistics (6 Features)SS1 Number of distinct scores in the top-10 scores
SS2 Mean of top-10 suspiciousness scores
SS3 Median of top-10 suspiciousness scores
…
10
Feature ExtractionGaps (11 Features)
G1 R1 – R2
G2 R2 – R3
Gi Ri – R(i+1), where
…
Relative Difference (8 Features)RD1
RDi101
102
RR
RR
)82( where,101
101 iRR
RRi
)93( i
Model Learning• Extend off-the-shell Support Vector Machine• Imbalanced training data
– #ineffective instances > #effective instancesExtended Support Vector Machine (SVMEXT)
11
Maximum Marginal Hyperplane
Effective instances
Ineffective instances
12
SVMEXT
• For each effective instance, – We calculate its similarities to ineffective instances– Each instance is represented by a feature vector– Using cosine similarity:
)11(
)()(
)(50
1
250
1
2
50
1
p
ba
bap
i ii i
i ii
13
SVMEXT
• Sort effective instances based on their highest similarities with ineffective instances (descending)
• Duplicate effective instances at the top of the list until training data is balanced.
selected effective instances
Effective instances
Ineffective instances
Overall Framework
14
Model
Suspiciousness Scores
Spectra
Fault Localization
Feature Extraction
Effectiveness Prediction
Prediction
Deployment Stage
1 3
15
Experiments
RecallPrecision
RecallPrecision2Measure-F
Recall
Precision
FNTP
TPFPTP
TP
.
• We use 10 fold cross validation.• We compute precision, recall and F-measure.
16
Effectiveness Labeling
• A fault localization instance is deemed effective if:– Root cause is among the top-10 most
suspicious program elements– If a root cause spans more than 1 program
elements• One of them is in the top-10
– Otherwise, it is ineffective
17
Dataset
• 10 different programs: – NanoXML, XML-Security, and Space– 7 programs from the Siemens test suites
• Totally, 200 faulty versions• For Tarantula, among the 200 instances:
– 85 are effective– 115 are ineffective
18
Research Question 1
• How effective is our approach in predicting the effectiveness of a state-of-the-art spectrum-based fault localization tool ?
• Experimental setting:– Tarantula – Using Extended SVM (SVMEXT)
19
Research Question 1
• Precision of 54.36%– Correctly identify 47 out of 115 ineffective fault
localization instances• Recall of 95.29%
– Correctly identify 81 out of 85 effective fault localization instances
Precision Recall F-Measure54.36% 95.29% 69.23%
20
Research Question 2
• How effective is our extended Support Vector Machine(SVMExt) compared with off-the-shelf Support Vector Machine (SVM) ?
• Experimental Setting– Tarantula – Using extended SVM (SVMEXT) and off-the-
shelf SVM
21
Research Question 2
• Result
SVMEXT outperforms off-the-shelf SVM
SVMEXT SVM ImprovementPrecision 54.36% 51.04% 6.50%Recall 95.29% 57.65% 65.29%F-Measure 69.23% 54.14% 27.87%
22
Research Question 3
• What are the most important features ?– Fisher score is used to measure how
dominant and discriminative a feature is.
23
Top-10 Most Discriminative Features
50%
30%
10%
10%Relative Differences (5)
Program Elements (3)
Raw Scores (1)
Simple Statistics (1)
Gaps (0)
Traces (0)
1. RD72. RD83. RD64. PE15. PE26. SS17. RD58. RD19. PE410.R1
24
Most Important Features
• Relative Differences Features – C7
(1), C8(2), C6
(3), C5(7), and C1
(8)
score) nesssuspicioushighest i :(R
)82( where,
thi
101
101
101
1021
iRR
RRC
RR
RRC
ii
25
Most Important Features
• Program Elements– PE1
(4), PE2(5), and PE4
(9)
Failing Traces Correct Traces
#Program Elements PE1 PE2
2
14
PE
PEPE
26
Most Important Features
• Simple Statistics– SS1(6): Number of distinct suspiciousness
scores in {R1,…,R10}
• Raw Scores– R1
(10): Highest suspiciousness scores
27
Research Question 4
• Could our approach be used to predict the effectiveness of different types of spectrum-based fault localization tool ?
• Experimental setting:– Tarantula, Ochiai, and Information Gain– Using Extended SVM (SVMEXT)
28
Research Question 4
Tool Precision Recall F-Measure
Tarantula 54.36% 95.29% 69.23%
Ochiai 63.23% 97.03% 76.56%
Information Gain 64.47% 93.33% 76.26%
• F-Measure for Ochiai and Information Gain– Greater than 75%– Our approach can better predict the effectiveness of
Ochiai and Information Gain
29
Research Question 5
• How sensitive is our approach to the amount of training data ?
• Experimental setting:– Vary amount of training data from 10% to 90%– Random sampling
30
Research Question 5
10% 20% 30% 40% 50% 60% 70% 80% 90%0%
20%
40%
60%
80%
100%
120%
Precision Recall F-Measure
Amount of Training Data
31
Conclusion
• We build an oracle to predict the effectiveness of fault localization tools.– Propose 50 features capturing interesting
dimensions from traces and susp. scores– Propose Extend. Support Vector Machine (SVMEXT)
• Experiments– Achieve good F-Measure: 69.23% (Tarantula)– SVMEXT outperforms off-the-shelf SVM– Relative difference features are the best features
32
Future work
• Improve F-Measure further• Extend approach to work for other fault
localization techniques• Extract more features from source code
and textual descriptions, e.g., bug reports.
33
Thank you!
Questions? Comments? Advice?{btdle.2012, davidlo}@smu.edu.sg