On improving the accuracy of spectrum-based fault localization€¦ · On Improving the Accuracy of Spectrum-based Fault Localization . Patrick Daniel . Master of Science . 2014

On Improving the Accuracy of Spectrum-based Fault Localization

Patrick Daniel

Master of Science

2014

Supervisor : Sim Kwan Yong

Co-Supervisor: Dr. Lau Bee Theng

Faculty of Engineering, Computing and Science

Swinburne University of Technology

Sarawak, Malaysia

I

“Live as if you were to die tomorrow.

Learn as if you were to live forever.”

-Mahatma Gandhi

II

Abstract

This thesis focuses on improving the accuracy of Spectrum-based Fault Localization

(SBFL) technique to locate faulty code during the software debugging process. SBFL technique

works by analyzing the code execution information (spectra) of pass and fail test cases, which

are gathered during the software testing phase. These spectra are used to calculate and rank each

line of code according to their suspiciousness to be faulty. With this ranking information,

software developers will inspect software code from the highest ranked line of code until the

faulty code is located in the software. A more accurate SBFL technique will rank the faulty line

of code higher than the less accurate SBFL techniques, hence, reducing the lines of code need to

be inspected before the faulty code is successfully located.

This thesis contributes towards improving the accuracy of SBFL technique in numerous

ways. Firstly, we evaluate the accuracy of SBFL metrics in extreme composition of limited pass

and fail test cases and discovered that certain SBFL metrics perform better under these extreme

scenarios. Inspired by the observation that the absence of excessive pass test cases may reduce

noisy spectra and improve the accuracy of SBFL metrics, we propose noise reduction schemes

to filter test cases which provide duplicated, contradicting and ambiguous information and

evaluate the resulting accuracy improvements in SBFL metrics. Based on the results of our

empirical study, we provide a simple guide for SBFL practitioners to select the best performing

noise reduction scheme to improve the accuracy of the SBFL metrics that they use. We also

propose and develop a novel SBFL tool with test case preprocessor which allows SBFL

practitioners to apply noise reduction schemes to pre-process and filter test cases with

duplicated, contradicting and ambiguous spectra prior to applying the SBFL techniques.

Finally, we also attempt to improve the accuracy SBFL technique by proposing a new

SBFL metric based on pair scoring approach. This technique compares the execution paths of

every possible pairs of pass and fail test cases and assigns score to each line of code according

III

to its likeliness to be the faulty. We evaluate the accuracy of the proposed metric and compare it

with other existing SBFL metrics. Despite its simplicity, we found the proposed metric

outperformed majority of the existing SBFL metrics.

Overall, the noise reduction schemes and new pair scoring SBFL metric proposed in this

thesis have successfully achieved the objectives of improving the accuracy of SBFL.

IV

Acknowledgements

I would like to express my deepest gratitude to my supervisor, Mr. Sim Kwan Yong. I

would never been able to finish my thesis and research without his guidance, patience, and

friendship. I am grateful to my co-supervisor, Dr. Lau Bee Theng for her support during my

master course. Thanks go to Dr. Patrick Then Hang Hui and Mr. Valliappan Raman as well.

I would like to thank my parents, Muljadi Kusuma and Herni for their love and support

me with their best wishes. I also would gratitude my aunt, Dr. Sari Kusumawaty for her

motivation and funding my study. Thank you to Sheilla for her care and support. Thanks go to

Tang Siong Lik as well.

Patrick Daniel

Kuching, Sarawak, Malaysia

May 2014

V

Declaration

I declare that this thesis contains no material that has been accepted for the award of any other

degree or diploma and to the best of my knowledge contains no material previously published or

written by another person except where due reference is made in the text of this thesis.

Patrick Daniel

12th May 2014

VI

List of Publication

This thesis is mostly based on original work jointly conducted with my supervisor, Sim

Kwan Yong. The results of the research have been presented in the following four peer

reviewed publication:

1. P. Daniel and K.Y. Sim, “Debugging in the Extreme: Spectrum-based Fault Localization

with Limited Test Cases,” International Journal of Software Engineering and Its

Applications (IJSEIA), 2013, Vol.7, No.5, pp.403–412. (Scopus indexed)

2. P. Daniel and K.Y. Sim, “Noise Reduction for Spectrum-based Fault Localization,”

International Journal of Control and Automation (IJCA), 2013, Vol.6, No.5, pp.117-126.

(Scopus and EI Compendex indexed)

3. P. Daniel and K.Y. Sim, "Spectrum-based Fault Localization Tools with Test Case

Preprocessor," IEEE Conference on Open System (ICOS), 2013, pp. 162-167.

4. P. Daniel and K. Y. Sim, "Spectrum-based Fault Localization: A Pair Scoring Approach,"

Journal of Industrial and Intelligent Information, Vol. 1, No. 4, pp. 185-190, December

2013. doi:10.12720/jiii.1.4.185-190

VII

Table of Contents

1 Introduction ...................................................................................................................... 1

Problem Statement .................................................................................................... 3 1.1

Objectives ................................................................................................................. 3 1.2

Contributions ............................................................................................................ 4 1.3

Structure of the Thesis .............................................................................................. 5 1.4

2 Background and Literature Review ................................................................................... 6

Introduction .............................................................................................................. 6 2.1

Spectrum-based Fault Localization............................................................................ 7 2.2

SBFL Metrics ........................................................................................................... 8 2.3

Literature Review on Spectrum-based Fault Localization .........................................15 2.4

3 The Effects of Extreme Composition of Pass and Fail Test Cases on the Accuracy of

Spectrum-based Fault Localization ..........................................................................................22

Introduction .............................................................................................................22 3.1

Experimental Setup ..................................................................................................24 3.2

Experiment Result and Discussions ..........................................................................26 3.3

Conclusion...............................................................................................................35 3.4

4 Noise Reduction Schemes for Spectrum-based Fault Localization ....................................36

Introduction .............................................................................................................36 4.1

Software Artifacts Used for Empirical Study ............................................................38 4.2

Problems..................................................................................................................40 4.3

Proposed Noise Reduction Schemes .........................................................................41 4.4

VIII

Experiment Results ..................................................................................................42 4.5

Discussion ...............................................................................................................47 4.6

Conclusion...............................................................................................................49 4.7

5 Spectrum-based Fault Localization Tool with Test Case Pre-processor ............................51

Introduction .............................................................................................................51 5.1

Pre-processing Scheme ............................................................................................53 5.2

SBFL Tool with Test Case Pre-processor Tool .........................................................54 5.3

Case Study on Siemens Test Suite ............................................................................58 5.4

Conclusion...............................................................................................................60 5.5

6 A New Pair Scoring Metric for Spectrum-based Fault Localization ..................................61

Introduction .............................................................................................................61 6.1

Methodology of New SBFL Metric ..........................................................................61 6.2

Experiment ..............................................................................................................63 6.3

Result Analysis ........................................................................................................64 6.4

Conclusion...............................................................................................................68 6.5

7 Conclusion ......................................................................................................................70

Threats to Validity, Limitations and Future Work.....................................................74 7.1

8 Reference ........................................................................................................................76

9 Appendix ........................................................................................................................84

IX

List of Tables

Table 2.1 SBFL Metrics ........................................................................................................... 9

Table 2.2 Survey on testing object. ..........................................................................................13

Table 2.3 Siemens Test Suite...................................................................................................15

Table 3.1 Programs in Siemens Test Suite ...............................................................................25

Table 3.2 The accuracy of SBFL metrics (in pci) under extreme debugging scenarios with

limited test cases. ...................................................................................................28

Table 3.3 The accuracy of SBFL metrics under extreme debugging scenarios (sorted by pci). ..29

Table 3.4 “one fail all pass” scenario result for 106 numbers of faulty versions in Siemens Test

Suite ......................................................................................................................31

Table 3.5 “one pass all fail” scenario result for 106 numbers of faulty versions in Siemens Test

Suite ......................................................................................................................32

Table 3.6 “no pass all fail” scenario result for 106 numbers of faulty versions in Siemens Test

Suite ......................................................................................................................33

Table 4.1 Programs in Siemens Test Suite. ..............................................................................39

Table 4.2 The number of test cases removed by the proposed noise reduction scheme. ............43

Table 4.3 The pci of SBFL metrics under the proposed noise reduction schemes. ....................44

Table 4.4 Analysis of pci for 62 faulty versions with the best performing NRS. .......................46

Table 4.5 Guide to choose noise reduction scheme for SBFL practitioners. ..............................48

Table 5.1 Siemens Test Suite case study ..................................................................................59

Table 6.1 Siemens Test Suite specifications. ............................................................................64

Table 6.2 Average SBFL Metric Accuracy ..............................................................................65

Table 6.3 Average SBFL Metric Accuracy ..............................................................................66

Table 6.4 Comparison of pci between existing SBFL metrics and Pair Scoring metric. ............67

Table 9.1 print_tokens pci accuracy under three extreme scenarios ..........................................84

Table 9.2 print_tokens2 pci accuracy under three extreme scenarios ........................................85

X

Table 9.3 replace pci accuracy under three extreme scenarios ..................................................86

Table 9.4 schedule pci accuracy under three extreme scenarios ................................................87

Table 9.5 schedule2 pci accuracy under three extreme scenarios ..............................................88

Table 9.6 tcas pci accuracy under three extreme scenarios .......................................................89

Table 9.7 tot_info pci accuracy under three extreme scenarios .................................................90

Table 9.8 print_tokens pci accuracy with and without NRS applied .........................................91

Table 9.9 print_tokens2 pci accuracy with and without NRS applied .......................................92

Table 9.10 replace pci accuracy with and without NRS applied ...............................................93

Table 9.11 schedule pci accuracy with and without NRS applied .............................................94

Table 9.12 schedule2 pci accuracy with and without NRS applied ...........................................95

Table 9.13 tcas pci accuracy with and without NRS applied ....................................................96

Table 9.14 tot_info pci accuracy with and without NRS applied...............................................97

Table 9.15 Analysis of pci for 62 faulty versions with NRS 1 .................................................98

Table 9.16 Analysis of pci for 62 faulty versions with NRS 2 ..................................................99

Table 9.17 Analysis of pci for 62 faulty versions with NRS 3 ................................................ 100





XI

List of Figures

Figure 2.1 SBFL coefficient and Jaccard metric calculation. ....................................................10

Figure 2.2 PCI illustration. ......................................................................................................11

Figure 3.1 Trade-off between the time consumed in software testing & debugging processes ...22

Figure 4.1 Simple Median program for noisy test case example. ..............................................37

Figure 5.1 Block Diagram of the Proposed SBFL Tool ............................................................52

Figure 5.2 UML Interaction Diagram ......................................................................................55

Figure 5.3 File Browser Screenshot .........................................................................................55

Figure 5.4 Test Case Pool Screenshot ......................................................................................56

Figure 5.5 SBFL result screenshot ...........................................................................................56

Figure 5.6 Log record of the tool .............................................................................................57

Figure 6.1 Basic of Pair Scoring metric. ..................................................................................62

Figure 6.2 Pair Scoring Equation. ............................................................................................62

XII

GLOSSARY OF TERMS

SBFL, Spectrum-based Fault Localization, fault localization technique which utilize spectra

or test case execution profile.

Spectrum / Spectra, test case execution profile which contains information on execution or

non-execution of each line of code or statement in a program.

SBFL Metric(s), mathematic equation that is used to calculate the score for each line of code or

statement in a program for spectrum-based fault localization.

PCI, Percentage of Code Inspected to locate the faulty line of code, a measure of accuracy

for SBFL metrics

Test Case, input of a program in order to test the program conformance to its requirements and

gather execution profile for the purpose of of spectrum-based fault localization.

Line of Code / Statement, a single line of program instruction.

1

1 Introduction

Since 1950s, the term „Software‟ has been introduced to the world. It is a set of digital

instructions which are attached in physical object or hardware known as computer. This pair of

–ware (software and hardware) are bound together and cannot be separated to perform its

function. Hardware provides the environment and physical interaction with human, while

software is mastermind to respond to the human. Like many others human inventions, software

is created with the aim to assist human to perform tasks such as calculation, information record,

communication, simulation, etc.

As the processing power of hardware advances, larger and more complex software has

been developed to support human life. In modern days, pervasive software provides support and

solution in many areas of human life such as education, banking, military, space, medical,

navigation and entertainment, just to name a few. In many areas, physical information has been

migrated to digital information where software plays its role. Despite all these advancements

and achievements in its development and deployment, software remains as human creation.

Therefore, human errors remain pervasive in software. These human errors create faults in

software code, which may translate to software failures when the code is executed. For mission

critical software systems, failures could be costly. In sectors such as banking and investment,

software failures could potentially cause unimaginable financial losses [60][95]. In other areas

such as military and aviation, software failures may cause loss in human life.

Recent high profile lawsuits against car manufacturers over software failures have raised

public concerns over the potential damages that could be caused by software failures. In 2010, a

car manufacture, Toyota, was being sued for a fatal crash allegedly caused by software failure in

the car system. The lawsuit focused on the failure of “drive by wire” system which replaces

traditional mechanical and hydraulic control systems with electronic control systems, which

allows for more refined, computer-controlled acceleration. The suit claimed that the crash was

caused by the electronic control systems that suddenly accelerated to 100 miles per hour. In

2

same month, the same car manufacturer issued a recall in U.S of approximately 133,000 Prius

(and 14,500 Lexus vehicles) to update software in the vehicle‟s antilock brake system (ABS) to

patch a fault that causes uneven braking [17][18].

Given the pervasiveness of software adoption in our daily life and severity of the

damages that could result from software failure, the quality of the software developed is of

outmost concern. In order to ensure the quality of the software, software testing and debugging

are the crucial stages that are both time consuming and costly in software development life

cycle. In general, the software testing process aims to detect failures in the software under test.

Software failures are caused by „bug‟, which is the lay term for a faulty software code that has

been triggered during software execution. This causes the software to either terminate

unexpectedly (crash) or produce unexpected result.

Once a software failure is detected, the software debugging stage will follow. In the

debugging stage, the software developers will attempt to locate the code that causes the software

failure and fix the faulty software code. Fault localization is the main activity in the software

debugging stage where software developers search and examine the software code to locate and

identify faulty code that causes the software failure. Without the aid of any debugging technique

or tool, software developers will need to inspect the code line by line in order to locate the

faulty code. This process is not only very time consuming, but also requires good understanding

of the software code.

In the attempts to minimalize the human effort and locate the faulty code efficiently,

Spectrum-based Fault Localization (SBFL) has emerged as a promising technique and a key

area of research in software debugging [1][9][90]. SBFL works by analysing the code execution

information (spectra) of pass and fail test cases, which are gathered during the software testing

phase. These spectra will be used to calculate and rank each line of code according to their

suspiciousness to be faulty. In other words, the line of code with the higher ranking will be more

likely to be faulty. With this ranking information, software developer will inspect software code

from the highest ranked line of code until the faulty code is located in the software. A more

3

accurate SBFL technique will rank the faulty line of code higher than the less accurate SBFL

techniques, hence, reducing the lines of code need to be inspected before the faulty code is

successfully located. Therefore, research efforts in the area of SBFL have been focusing on

designing more accurate SBFL techniques to efficiently locate faulty code in software.

Problem Statement 1.1

As the processing power of hardware advances, larger and more complex software has

been developed to support human life. Despite the advancement made in software development

technologies and tools, bugs still exist in program codes because of human errors. Locating the

faulty line of code during the debugging process is a time consuming task, especially for

complex software that contain hundreds or thousands of lines of codes. Therefore, the aim of

this thesis is to improve the accuracy of spectrum-based fault localization (SBFL) in order to

minimize the time and effort required to locate faulty code in a program during the debugging

process.

Objectives 1.2

The overall aim of this thesis is to improve the accuracy of spectrum-based fault

localization (SBFL) in order to minimize the time and effort required to locate faulty code in a

program. Based on previous literatures, the accuracy of SBFL is highly dependent on test case

profiles and the ranking metric used in the SBFL process. Therefore, this thesis attempts to

improve the accuracy of SBFL from these two directions. The objectives of this thesis are:

1. To evaluate the effects of extreme composition of pass and fail test case on the

accuracy of SBFL Metrics.

2. To design and develop test case filtering techniques and tools to remove test cases

with noisy spectra that may deteriorate the accuracy of SBFL metrics.

3. To develop a new SBFL metric to more accurately rank lines of code in a program

according to its likeliness to be faulty.

4

Contributions 1.3

This thesis answers the following research questions and made the following

contributions towards improving the accuracy of Spectrum-based Fault Localization (SBFL)

techniques:

1. Does extreme composition of pass and fail test cases affect the accuracy of SBFL? In

the most extreme scenarios, the debugging process may have to be conducted with

only one fail test case, one pass test case or no pass test case. These scenarios might

occur due to extremely high or extremely low failure rates or when software testers

decide to stop running more test cases due to time and resource constraints. In view of

this, we evaluated the accuracy of SBFL metrics in these extreme scenarios. From the

experiment results, we discovered the convergence in accuracy of SBFL metrics under

these extreme scenarios. We further discovered that some SBFL metrics perform

better under these extreme scenarios.

2. Does removal of noisy test cases improve the accuracy of SBFL metrics? Some SBFL

metrics deliver better accuracy in extreme scenarios with limited pass test cases. This

suggests that excessive pass test cases could be noisy spectra to the SBFL metrics.

Spectra that contain duplicated, contradicting and ambiguous information or noise

may deteriorate the accuracy of SBFL metrics. We proposed noise reduction schemes

to filter test cases which provide duplicated and ambiguous information and evaluated

the resulting accuracy improvements in SBFL metrics.

3. Based on our findings in Contribution 2, we provided a simple guide for SBFL

practitioners to select the best performing noise reduction scheme to improve the

accuracy of the SBFL metrics that they use.

4. We proposed and developed a novel SBFL tool with test case preprocessor which

allows SBFL practitioners to apply noise reduction schemes to pre-processing and

filter test cases with contradicting, duplicated and ambiguous information prior to

applying SBFL.

5

5. We proposed a new SBFL technique based on pair scoring approach. This technique

compares the execution paths of every possible pairs of pass and fail test cases and

assigns score to each line of code according to its likeliness to be the faulty line of

code. We formulated a new SBFL metric for this technique and evaluated the accuracy

of the proposed metric and compare it with another existing SBFL metrics. Despite its

simplicity, we found that the proposed metric outperformed the majority of the

existing SBFL metrics.

Structure of the Thesis 1.4

The remaining of this thesis is organized as follow: Chapter 2 introduces the SBFL

technique, the software artifacts used as the objects of experiments in this thesis and review the

literatures related to this thesis. Chapter 3 presents the study the accuracy of SBFL metrics in

extreme scenarios with one fail test case, one pass test case or no pass test case and identify the

convergence and improvement of some SBFL metrics under these scenarios. In Chapter 4, we

propose noise reduction schemes to filter test cases with contradicting, duplicated and

ambiguous information and evaluate the resulting accuracy improvements in SBFL metrics.

Chapter 5 presents the SBFL tools with test case pre-processor that has been developed to

deploy the noise reduction schemes in Chapter 4. A new SBFL metric based on “pair scoring”

is presented in Chapter 6. Chapter 7 discusses the limitations of the studies in this thesis, the

future work and concludes the thesis.

6

2 Background and Literature Review

Introduction 2.1

Software testing and debugging are key activities in software development lifecycle to

improve the quality of the software developed. Software testing and debugging are not only the

most expensive but also the most time consuming activity in software development life cycle

[34][15][30][69][88]. Software researchers have proposed various testing strategies and fault

localization techniques in the effort to reduce time and cost of testing and debugging.

Even though they are often viewed as a single quality assurance process, testing and

debugging have distinctive aims. Software testing aims to detect or reveal failures in the

software developed through execution of test cases. On the other hand, software debugging is

meant to locate and correct the faulty code in the software which causes the failure to happen.

Mistakes which are inadvertently committed by software developer in program code may

lead to software failures such as software crash and unexpected software behaviours or incorrect

outputs [66]. The software testing process is conducted by executing a range of test cases (test

inputs) on software developed with the aim to detect software failures. For a given test case,

software tester is able to determine the output of the software as pass or fail by comparing it

with the software specification. Therefore, in software testing, time and manpower are required

to design, generate and execute test cases as well as verify test outputs in order to detect failures

in the software under test.

However, the software testing process can only show presence of failures but not locating

the faulty line of code that causes the failure. Once failures are detected, the next follow up

stage is software debugging, where software developer attempts to locate and correct the faulty

software code that causes of the failure [7][78]. The longer the software code, the more time it

may take to locate the code that causes the failure. Good knowledge of the software code is

essential to locate the faulty code. Therefore, normally the developers of software themselves

are one who are responsible to locate and fix the faulty code [72]. Locating the faulty line of

7

code in the software code is a time consuming task, especially for software that contain

thousands of lines of codes. Hence, various fault localization techniques have been proposed to

reduce the time required to locate the faulty line of code. One of the promising and widely

studied fault localization techniques is called Spectrum-based Fault Localization (SBFL).

Spectrum-based Fault Localization 2.2

Spectrum-based Fault Localization (SBFL) works by utilizing the code execution records

of test cases commonly known as “Spectrum”. Program spectrum is a record of the lines of

code in a program which were executed (or not executed) by a test case. For each test case being

executed, a spectrum is recorded. In software testing, a collection of test cases are normally

executed with the aim to reveal as many software failures as possible. From these test cases, a

collection of spectrum, or “Spectra” (plural form of spectrum) can be collected.

In software testing, a test case is executed on the software under test and its output is

compared with the expected output. The test case is categorized as a pass test case if the output

is the same as the expected output. On the other hand, if the output differs from the expected

output or cause the program to fail or crash, the test case is categorized as fail test case.

If a test case executes the faulty line of code, it may cause the software to crash or

produce incorrect output(s). Intuitively, the faulty line of code will be executed by more fail test

cases than pass test cases, and vice versa. Therefore, by analysing the spectra of pass and fail

test cases, software developers can obtain useful information on how likely a line of code to be

faulty. As the spectra are used to locate the faulty line of code, this debugging technique is more

commonly known as spectrum-based fault localization (SBFL) or spectra debugging.

In SBFL, four common coefficients are computed as the spectrum for line of code [1].

These coefficients are aef, anf, aep, and anp. The first coefficient, aef, represents the number of

fail test cases that have executed that line of code, whereas the second coefficient, anf,

represents the number of fail test cases that have not executed that line of code. Similarly, the

third coefficient, aep, represents the number of pass test cases that have executed that line of

code, whereas the last coefficient, anp, represents the number of pass test cases that have not

8

executed that line of code. Intuitively, a faulty line of code will have high values for aef and anp

and low values for anf and aep.

Based on coefficients aef, anf, aep, and anp, an SBFL metric is used to compute a score

for every line of code to rate its likeliness to be faulty. The lines of codes are then ranked from

the highest score to the lowest score. To locate the faulty line of code, a software developer will

inspect the highest ranked line of code first followed by the lower ranked line of code until the

faulty line of code is located.

SBFL Metrics 2.3

SBFL metric is a mathematic equation which makes use of one or more of the four

coefficients of SBFL to calculate a score for every line of code to indicate its likeliness to be

faulty. Software debugging researchers have proposed various SBFL metrics [2][35][55][87]

with the aim to produce the most accurate ranking for fault line of code. By ranking the faulty

line of code as high as possible, it will save software developers‟ time to locate the faulty line of

code in real life situation by inspecting as few line of code as possible. Table 2.1 show the list of

SBFL metrics are being proposed in previous work which will be evaluated in the experiments

conducted in this thesis.

Each of SBFL metrics will make use at least one from four SBFL coefficients, aef, anf,

aep and anp to calculate the score for each line of code as illustrate in the sample program

Figure 2.1. This sample program is tested with five inputs (test cases), namely, a, b, c, d and e,

where test cases a, d and e are pass test cases, while test cases b, c and f are fail test cases. For

each test case, the value of „1‟ indicates that the particular line of code is executed by the test

case. On the other hand, „#‟ indicates that the particular line of code is not executed by the test

case. A „-„ is used to indicate the line of code which is either blank line or non-executable.

Based on these spectra gathered from the executed of test cases, the SBFL coefficients, aef, anf,

aep and anp are then calculated. An SBFL metric, such as Jaccard, can be used to calculate a

score indicating the likeliness for each line of code to be faulty. In the sample program in Figure

2.1, statement8 is the faulty line of code. It has been executed by two fail test cases (aef = 2) but

9

none of the pass test cases (aep = 0). It has not been executed by one of the fail test case (anf =

1) and three pass test cases (anp= 3). Hence, this line of code scores 0.67 on Jaccard SBFL

metric, making it the line of code with the highest score.

Table 2.1 SBFL Metrics

Name Formula Name Formula

Naish1 [70] {

Zoltar [26]

Naish2 [70]

Simple Matching [79]

Jaccard [42]

Sokal [62]

Anderberg [6]

Rogers & Tanimoto

[74]

Sorensen-Dice [80]

Russel & Rao [76]

Dice [80]

AMPLE [19] |

|

QE [53]

Tarantula [49]

(

)⁄

Wong1 [83] CBI Inc. [58]

Hamming etc. [35] Ochiai [71]

√

Binary [70] {

Euclid [51] √

Kulczynski1 [62]

AMPLE2 [35]

M1 [22]

M2 [22]

Wong3 [83] {

Ochiai2 [71]

√

Arithmetic Mean [75]

Geometric Mean [65]

√

Harmonic Mean [75]

Rogot2 [75]

(

)

Cohen [16]

10

Figure 2.1 SBFL coefficient and Jaccard metric calculation.

A higher score indicates that the line of code is more suspected to be faulty. On the other

hand, a lower score indicates that the line of code is less suspected to be faulty. During the

debugging process, SBFL metric score will be sorted from the highest to the lowest as

illustrated in Figure 2.2. Software developers will start inspecting the code from line of code

with the highest score. Therefore, an accurate SBFL metric will rank faulty line of code as high

as possible so that only few lines of code need to be inspected before the faulty line is

successfully located. Hence the accuracy of a SBFL metric is commonly measured with the

percentage of code inspected (pci) before the faulty line is successfully located. The calculation

of pci is based on two elements which are fault ranking and total line of code (LoC). Fault

ranking refers to the rank of faulty line of code when the lines of codes are sorted in ascending

order based on the SBFL metric score. Total LoC refers to the number of lines of code in the

program.

11

Figure 2.2 PCI illustration.

In this thesis, the accuracy of SBFL metrics is measured with pci. The smaller the pci

value, the more accurate the SBFL metric is. In general English, the term “accuracy” is

commonly used to represent the conformity to fact; precision; exactness; or the ability of a

measurement to match the actual value of the quantity being measured. However, in the context

of SBFL metric, the most accurate SBFL metric will rank the faulty line of code as the first line

of code to be inspected. On the other hand, a less accurate SBFL metric will rank the correct

12

line of code as the first line of code to be inspected. Therefore based on the descriptions we used

term „accuracy‟ to represent how many line of code needs to be inspected by software developer

to locate the faulty line of code. The fewer line of code to be inspected to locate the faulty line

of code, the more accurate an SBFL metric is. In other related studies, the accuracy of SBFL

metrics is also termed as the “performances of SBFL metrics” which are represented by EXAM

score [56][86][87] and Expenses [73], both of which are equivalent to the term pci used in this

thesis. Testing Objects

As the program structure and fault in every program are different, the accuracy of each

SBFL metric may differ from one faulty program to another. This brings up a long standing

problem for empirical research in software engineering: there is no formal way of how to

choose a representative sample of programs [64]. Therefore, the accuracy of SBFL metrics is

normally evaluated on a suite of different programs, with each program having a number of

faulty versions.

In order to survey the programs that are normally used to evaluate the accuracy of SBFL

metrics, we have reviewed 40 papers related to fault localization based on program spectra. The

summary of programs that have been used to evaluate the accuracy of SBFL metrics is

presented in Table 2.2. From this survey, we found twelve programs, namely, print_tokens,

print_tokens2, replace, schedule, schedule2, tcas, tot_info, space, gzip, sed, grep, and flex are

widely used (in at least seven out of the 40 papers surveyed) to evaluate the accuracy of SBFL

metrics. Other programs which are only used once in isolated study are categorized as Other.

Out of the twelve widely used programs, seven programs which belong to the “Siemens Test

Suite” [41] (print_tokens, print_tokens2, replace, schedule, schedule2, tcas, and tot_info) have

been used in 32 out of the 40 papers surveyed, making it the suite of programs that is most

commonly used to evaluate the accuracy of SBFL metrics. The Siemens Test Suite of programs

has been used in the majoring of the studies on SBFL to facilitate, to ease benchmarking and to

compare the accuracy.

13

Table 2.2 Survey on testing object.

Author Year

prin

t_to

kens

prin

t_to

kens

2

repl

ace

sche

dule

sche

dule

2

tcas

tot_

info

spac

e

gzip

sed

grep

flex

othe

r*

Hutchins et al [41] 1994 ● ● ● ● ● ● ●

Harrold et al [38] 1998 ● ● ● ● ● ● ●

Harrold et al [39] 2000 ● ● ● ● ● ● ●

Jones, & Harrold [50] 2005 ● ● ● ● ● ● ●

Gupta, Zhang, & Gupta [33] 2005 ● ● ● ● ●

Do, Elbaum, & Rothermel [21] 2005 ● ● ● ● ● ● ● ● ● ● ● ● ●

Zhang, Gupta, & Gupta [92] 2006

●

● ● ●

Hao et al [37] 2006 ● ● ● ● ● ● ●

Liu et al [59] 2006 ● ● ● ● ● ● ●

Abreu, Zoeteweij, & Van Gemund [1] 2007 ● ● ● ● ● ● ●

Yu, Jones, & Harrold [91] 2008 ● ● ● ● ● ● ● ●

Jeffrey, Gupta, & Gupta [45] 2008 ● ● ● ● ● ● ●

Naish, Lee, & Ramamohanarao [70] 2009 ● ● ● ● ● ● ● ●

Abreu et al [3] 2009

●

Abreu, Zoeteweij, & Van Gemund [2] 2009 ● ● ● ● ● ● ● ● ● ●

Lee, Naish, & Ramamohanarao [54] 2009 ● ● ● ● ● ● ●

Dean et al [20] 2009 ● ● ● ● ● ● ● ●

Jiang et al [46] 2009 ● ● ● ● ● ● ●

● ● ● ●

Santelices et al [77] 2009 ● ●

● ● ● ●

●

Xie, Chen, & Xu [85] 2010 ● ● ● ● ● ● ●

Abreu, Gonzalez-Sanchez, & van Gemund [4] 2010 ● ● ● ● ● ● ● ● ● ●

14

Author Year

prin

t_to

kens

prin

t_to

kens

2

repl

ace

sche

dule

sche

dule

2

tcas

tot_

info

spac

e

gzip

sed

grep

flex

othe

r*

Baah, Podgurski, & Harrold [8] 2010 ● ● ● ● ● ● ● ●

●

Jiang, & Chan [47] 2010 ● ● ● ● ● ● ●

Gonzalez-Sanchez et al [27] 2010 ● ● ● ● ● ● ●

Bandyopadhyay [10] 2011 ● ● ● ● ● ● ●

Bandyopadhyay & Ghosh [9] 2011 ● ● ● ● ● ● ●

Xie et al [86] 2011 ● ● ● ● ● ● ●

Alves et al [5] 2011 ● ●

● ● ● ●

Jiang, Chan, & Tse [48] 2011 ● ● ● ● ● ● ●

● ● ● ●

Gonzalez-Sanchez et al [28] 2011 ● ● ● ● ● ● ● ● ● ● ● ●

Gonzalez-Sanchez et al [29] 2011 ● ● ● ● ● ● ● ● ● ● ●

Wang et al [81] 2011 ● ● ● ● ● ● ●

Gong et al [24] 2012 ● ● ● ● ● ● ● ●

Bandyopadhyay [11] 2012 ● ● ● ● ● ● ●

●

●

Gong et al [25] 2012 ● ● ● ● ● ● ● ● ● ● ● ●

Wong, Debroy, and Xu [84] 2012 ● ● ● ● ● ● ● ● ●

●

●

Lei, Mao, & Chen [56] 2013 ● ● ● ● ● ● ●

●

Yoo, Harman, & Clark [89] 2013 ● ● ● ●

Ma et al [63] 2013 ● ● ● ● ●

You et al [90] 2013 ● ● ● ● ●

15

Table 2.3 Siemens Test Suite

Program Faulty Versions LOC Number of Test Cases Description

print_tokens 7 563 4130 Lexical analyser

print_tokens2 10 508 4115 Lexical analyser

replace 32 563 5542 Pattern recognition

schedule 9 410 2650 Priority scheduler

schedule2 10 307 2710 Priority scheduler

tcas 41 173 1608 Altitude separation

tot_info 23 406 1052 Information measure

In view of this, we use the seven programs in Siemens Test Suite as our testing objects to

evaluate the accuracy of SBFL metrics. The programs in Siemens Test Suite can be downloaded

from the Software Information Repository [21], which is maintained by the University of

Nebraska-Lincoln at http://sir.unl.edu. For each of the seven programs in in Siemens Test Suite,

there is one correct version and multiple faulty versions of the program. Table 2.3 lists the

programs in the Siemens Test Suite, the total number of faulty versions for each program,

number of lines of code, the total number of test cases in the test suite and a brief description of

each program. We have used the GCC version 4.6.1 and Gcov (GNU-GCC), running on an

Ubuntu 11.10 workstation to collect the spectra information from these versions of programs in

Siemens Test Suite.

Literature Review on Spectrum-based Fault Localization 2.4

Despite the advancement made in software development technologies and tools, bugs still

exist in program codes because of human errors. Over the past few years, research in fault

localization has intensified in response to the increasing complexity of software systems.

Spectrum-based Fault Localization (SBFL) has emerged as one of the promising techniques for

fault localization [87]. New SBFL metrics, models, empirical studies, tools and theoretical

16

analysis have been studied and proposed by researchers in the area software engineering. These

studies have been conducted with one common goal: to improve the efficiency of fault

localization in the debugging process. In this section, we review recent literatures on the state-

of-the-art and advancements made in the field of SBFL.

Based on our literature review, recent work related SBFL can be broadly categorized into

five categories:

1. Design and adoption of SBFL metrics,

2. Optimization of an existing SBFL metric

3. Spectra / test suite manipulation

4. Optimal SBFL metrics based on model and theoretical analysis, and

5. SBFL in the absence of test oracles.

The first category of research work in SBFL is the design and adoption of SBFL metrics.

As presented in Table 2.1 in Section 2.3, a large number of SBFL metrics have been designed

and adopted in the attempt to improve the accuracy of SBFL. Of these metrics, some metrics

such as Naish1 [70], Naish2 [70], Wong1 [83], and Wong3 [83] were specifically designed for

SBFL. However, the others were initially designed in other application domains. For example,

Jaccard [42] was initially designed for classification in the botany domain in 1901. Ochiai [71]

was originally designed in 1957 to analyse the ecological relationships of different species of

fishes in Japan. Hamming [35] was originally designed for error detection in codes in 1950.

These metrics have later been adopted as SBFL metrics in the fault localization domain.

In addition to designing new SBFL metrics and adoption of existing metrics, researchers

in this area have also attempted to optimize the accuracy of existing SBFL metrics. Recently,

You et al [90] proposed modified similarity coefficients for SBFL metrics. This approach is

based on the concept that the fail test cases may contribute more testing information than pass

test cases. They implemented their approach on three of the widely used SBFL metrics, namely,

Jaccard, Tarantula, and Ochiai by modifying the weightings of SBFL coefficients in these SBFL

metrics. The results of their empirical study showed that the accuracy of these three modified

17

SBFL metrics improved significantly compared to the corresponding original unmodified SBFL

metrics. Therefore, they concluded that the accuracy of the SBFL metrics could be improved by

assigning higher weight on SBFL coefficients related to fail test case compared to pass test

cases.

Most of the efforts to improve the accuracy of SBFL have been focusing on manipulation

of test cases or spectra gathered from test cases. Early study in 2007 by Abreau, Zoeteweij, and

Gemund [1] related the accuracy of SBFL metrics to test design. In [1], Abreau, Zoeteweij, and

Gemund proposed SBFL as a light weight automated diagnosis technique that can be integrated

with exiting testing scheme. Using Siemens Test Suite, they investigated the fault localization

accuracy as a function of several parameters, some of which are directly related to the test

design. Their result showed that the superior accuracy of certain SBFL metrics used to analyse

the program spectra are largely independent of test design.

In 2011, Bandyopadhyay and Ghosh study [9] challenged the approach used in SBFL

metrics such as Tarantula and Ochiai where suspiciousness score for lines of code in a program

is calculated using the number of pass and fail test cases that execute each line of code. These

approaches consider that all test cases are equally important. In contrary to this, their study has

shown that by using proximity based weighting to calculate the relative importance of test cases,

the accuracy of SBFL metrics can be improved.

In another paper, Bandyopadhyay [11] proposed approaches to predict and mitigate the

effect of coincidental correctness in SBFL. Coincidentally correct test case is test case that

executes the faulty code but does not cause failures result. The presence of coincidentally

correct test cases might negatively impact the accuracy of SBFL metrics. Two approaches were

proposed to predict coincidentally correct test case. In the first approach, weights were assigned

to passing test cases such that the test cases that are likely to be coincidentally correct will

obtain lower weights. These weights were then used to calculate score for each line of code

using an SBFL metric. In the second approach, coincidentally correct test cases were iteratively

predicted and removed. The suspiciousness score for each line of code is then calculated based

18

on the reduced test suite. As the result, the accuracy of SBFL metrics studied achieved

significant improvement by when these two approaches were implemented.

In 2010, Xie, Chen, and Xu [85] proposed a refinement method to improve the accuracy

of SBFL by separating the line of code executed by fail test cases and pass test cases into two

groups. As the fail test cases would definitely executed the faulty code, lines of code that were

executed by fail test cases were categorized into a suspicious group while the others were

categorized into the unsuspicious group. SBFL metrics would only be applied to calculate the

score for lines of code in the suspicious group while the others in the unsuspicious group would

be assigned the lowest score. Their empirical studies have shown that the accuracy of SBFL

metrics can be enhanced with this approach.

The effectiveness of using non redundant test cases in SBFL has been studied by Lee,

Naish, and Ramamohanarao in 2011. They presented their approach of using non-redundant test

cases (test cases with unique coverage) to locate software bugs in a program. They show that by

adding duplicates of non-redundant test cases, the stability and accuracy of spectra metrics are

negatively affected. They concluded that by using non-redundant test cases, the accuracy of

SBFL metrics are better compare to using redundant test cases [54]. The use of test case

filtering was first studied by Leon, Masri, and Podgurski [57] in 2005 in an empirical evaluation

of test case filtering. They evaluated several test case filtering techniques based on exercising

complex information flows that includes coverage-based and profile-distribution-based filtering

techniques. They found that the effectiveness of distribution-based techniques did not depend

strongly on the type of profiling used.

The emergence of adaptive random testing as the enhanced alternative of random testing

has indirectly impacted the testing and fault-localization strategies as reported by Lou, Kuo and

Chen [61] in 2012. Adaptive random testing has been shown to require fewer test cases than

random testing in order to detect the first failure of the program. Once a failure is detected,

testing may be terminated and fault localization may start immediately. This strategy implies

that less information is available to provide hints on the location of faulty code because less fail

19

test cases will be available for analysis. In many practical situations, testing will continue to be

conducted after the detection of first failure because multiple failures may exist in the program.

More recently, researchers in the area of SBFL have attempted to use model based

approach and theoretical analysis to design and proof optimal SBFL metrics. The earlier work

on model-based debugging were done in 2009 by Abreau et al [2], who proposed the

combination of SBFL with a model-based debugging approach based on abstract interpretation

within a framework named DEPUTO. Their results showed that the combined approach

outperforms the individual approaches and other state-of-the art automated debugging

techniques. In some cases a single fault might not cause the failure to appear, but multiple faulty

lines of codes might. Therefore, they presented a framework to combine the best of both worlds,

coined BARINEL [2]. Under this framework, a program is modelled using the abstraction of

program traces while Bayesian reasoning is used to deduce multiple-fault candidate and their

probabilities. The experimental result showed that on both synthetic and real software programs,

BARINEL typically outperforms other SBFL approaches at a cost complexity that is only

marginally higher.

In 2011, Naish, Lee, and Ramamohanarao [70] proposed a model on SBFL using on a

simple if-then-else (ITE) program with a single bug. The accuracy of different SBFL metrics

can be evaluated in idealised conditions by examining different possible execution paths

through the ITE model program over a number of test cases. Based on their analysis on this

simple model, groups of metrics that are equivalent in accuracy was identified. More

importantly, two SBFL metrics (Naish1 and Naish2 in Table 2.1) with optimal accuracy have

been proposed based on this model.

Subsequently, Xie et al [87] presented theoretical framework to investigate on the

effectiveness of risk evaluation formula (SBFL) in 2013. The framework on this theoretical

analysis is based on the concept that the determinant for the accuracy of a metric is the number

of line of codes with SBFL scores higher than the scores of the faulty line of code. From their

analysis, they grouped the SBFL metrics has equivalent accuracy with each other into five

20

groups. They furthered provided analytical proofs that, under a set of assumptions and

generalization, two groups of the SBFL metrics will always deliver the optimal accuracy

compared to other groups. However, they affirmed that theoretical analysis and the empirical

analysis are both essential and complementary to each other in software analysis and testing.

Empirical study is useful to expose some interesting phenomena which may conjecture the

generalization that needs to be verified by theoretical analysis. In contrary, a theoretical study

can be completely solved by theoretical approach. Unresolved problems may be encountered

during the theoretical analysis, which are worthwhile and critical to be studied with an empirical

approach.

In response to Xie et al [87] theoretical proofs of optimal groups of SBFL metrics, Le,

Thung, and Lo [52] conducted empirical study to show that the theoretical optimal SBFL

metrics may be less accurate under practical situations where the assumptions of the theoretical

analysis do not hold. For example, they showed that Ochiai and Tarantula which are

theoretically non-optimal were more accurate than the theoretically optimal SBFL metrics when

code coverage of the test cases is less than 100%. Therefore, despite the theoretical proofs on

the optimality of certain SBFL metrics, the theoretically non-optimal SBFL metrics remain

relevant and viable options for adoption in practice to locate faulty line of code during the

debugging process.

To enable practical application of SBFL by software developers in real-life projects,

researchers in the area of SBFL have also developed numerous tools to support practical

application of SBFL. In 2009, Janssen, Abreu, and van Gemund [44] developed an automatic

fault localization toolset to support the application of SBFL in practice, named Zoltar [43]. The

toolset provides the infrastructure to automatically instrument the source code of software

programs to produce runtime data. The primary advantage of Zoltar toolset is the variety of

types of bugs that can be located using the underlying SBFL technique. In order to enhance the

applicability of this tool, support for automatic error detection is also provided through the

ability to test programs at runtime without the need for manual invariant programming. In

21

addition, the Zoltar toolset can also be used to trigger automatic recovery mechanisms, paving

the way for fully automatic, runtime system fault diagnosis, and isolation. In 2012, Campos et al

further developed Zoltar into GZoltar, which is Eclipse IDE plug-in for testing and debugging

[12].

The last category of research work in the area of SBFL is related to test oracle.

Conventionally, all SBFL metrics assume the existence of test oracle (expected correct output

for a given test input / test case) to help tester to classify all test cases as either pass or fail.

However, for many programs, test oracle does not exist (that is, the correct outputs of these

programs are unknown). These programs were developed to find solutions to certain problems

which the solutions were unknown. If the solutions were known, there would be no need to

develop the programs. Therefore, the assumption that test oracles exist for such program is

invalid. Recently, metamorphic testing [13][14][31][67][68] has been proposed to detect faults

in programs in the absence of test oracles. Based on metamorphic testing, Xie et al [86]

presented a novel concept called metamorphic slice (mice) that allowed SBFL techniques to be

used even in the absence of test oracle. A mice is the group of program slices which are bound

together with certain program property which is known as metamorphic relation. This method

can be applied on programs with oracle problem where conventional SBFL cannot be adopted

in such cases. Their empirical study showed that the accuracy of their concept is similar to that

of conventional SBFL technique. In view of this, they concluded that testing oracles are no

longer mandatory for SBFL.

22

3 The Effects of Extreme Composition of Pass and

Fail Test Cases on the Accuracy of Spectrum-based

Fault Localization

Introduction 3.1

In software testing activity, the software testers are required to execute the software under

test with test input (test case) in order to detect and reveal failures in the software. The number

of test cases executed during the testing process might vary. It depends on the complexity of the

software input domain, the test case selection strategy and the time and human resources

available to run the tests. However, the number of test case executed during software testing

affects the accuracy of SBFL metrics [23][24][32][61][82][93].

Generally, more test case execution profiles will give more information about the location

of faulty line of code. This will in turn reduce the time consumed in the fault localization

process to locate the faulty line of code. In contrary, fewer test case execution profiles will

provide less information about the likely location of faulty line of code, which in turn may

increase the time consumed in the fault localization process to locate the faulty line of code.

Figure 3.1 Trade-off between the time consumed in software testing & debugging processes

23

This trade-off is illustrated in Figure 3.1. Assume that initially the time consumed by

software testing to execute test cases is equal to the time spent on software debugging. Through

various test set minimization strategies that reduce the number of test cases to be executed,

software tester could reduce the time consumed during the testing process. However, fewer test

cases could lead to less execution profiles or spectra and less information about the likely

location of the faulty line of code. This, in turn, increases the time consumed in the fault

localization process to locate the faulty line of code.

SBFL relies on line of code execution information (spectra) obtained from the executed

test cases to rank lines of codes according to their likeliness to be faulty. In the most extreme

scenarios, fault localization may be started with only one failed test cases, one pass test cases or

no pass test cases, which may have adverse effects on the accuracy of SBFL metrics. In this

chapter, the accuracy of SBFL metrics is evaluated under these extreme test case compositions

in the attempt to answer the following research questions:

Research Question 1:

What is the accuracy of each SBFL metrics under the scenarios where there are

extreme composition of pass and fail test cases?

Research Question 2:

Which SBFL metrics have the highest accuracy under the extreme compositions of

pass and fail test cases?

There are a few reasons that motivate us to evaluate and perform experimentations related

to these extreme scenarios where only limited pass or fail test cases are available. Firstly, when

gathering the spectra from the execution of test cases in Siemens test suite, we observed that

pass test cases dominated the collection of the test cases in the test suite while fail test cases are

rare. In a related study on Adaptive Random Testing (ART) and debugging, Lou, Kuo and Chen

24

[61] reported that as a test case minimization technique, Adaptive Random Testing can detect

the first failure in software under test with up to 50% less test cases than Pure Random Testing.

Therefore, as soon as the first failure (first fail test case) is detected, it is possible that for the

testing process to be terminated and the fault localization process to be started. Even if the

testing process is continued, there is no guarantee that more failures (fail test cases) will be

detected because of the low failure rate of the software where failure may only be triggered by

very specific combination of test inputs (test cases). Therefore, in practice, it is possible that

fault localization has to be conducted with the extreme composition of only one fail test case

and many pass test cases.

On the other extreme, it is also possible that failure rate of the software under test is very

high where failures can easily be triggered by a lot of test inputs (test cases). In such scenario,

fault localization may be started with one or no pass test cases and many fail test cases.

The accuracy of SBFL metrics under these extreme scenarios will be evaluated

empirically. Based on the results of our empirical study, we will recommend SBFL metrics to

be used for fault localization activity under each of these extremes scenarios.

Experimental Setup 3.2

In our empirical study, we use the seven programs from the Siemens Test Suite to

evaluate the accuracy of SBFL metrics under extreme compositions of pass and fail test cases.

All faulty versions of the seven programs in Siemens Test Suite are used as the targets of testing

and debugging. However, identical versions, which are faulty versions that produce identical

outputs with the correct versions are excluded because SBFL cannot be conducted without a fail

test case. Furthermore, we focus our study on faulty versions with a single faulty line of code.

As a result, we have excluded print_tokens {v4, v6} because these versions are identical with

the correct version of the program. In addition, print_tokens {v1}, replace {v21}, schedule {v2,

v7}, and tcas {v10, v11, v15, v31, v32, v33, v40} have also excluded because multiple faulty

lines of code exist in the program code. We have also excluded print_tokens {v2}, replace

{v12}, tcas {v13, v14, v36, v38}, tot_info {v6, v10, v19, v21} because the faulty lines of code

25

is a non-executable line of code. Lastly, print_tokens2 {v10}, replace {v32}, and schedule2

{v9} have been excluded because these versions do not have any fail test case even though

faulty line of code existed in program. In total we use 106 faulty versions in our experiment to

evaluate the accuracy of SBFL metrics under extreme compositions of pass and fail test cases.

In our experiment, we first execute all test cases in Siemens Test Suite to simulate the “normal

scenario” where the spectra of all pass and fail test cases were included to calculate the SBFL

metric scores to obtain the percentage of code inspected (pci) to locate the faulty line of code.

The more accurate an SBFL metric, the lower the pci is and therefore the better performing an

SBFL metric is. Subsequently, we repeat the experiment under three scenarios with extreme

compositions of pass and fail test cases.

The first scenario with extreme composition of pass and fail test cases is the “one fail all

pass” scenario. To evaluate the accuracy of SBFL metrics in this scenario, we randomly select

and use the spectra of one fail test case and all pass test cases provided in the Siemens Test

Suite to calculate the SBFL metric scores and obtain the percentage of code inspected (pci) to

locate the faulty line of code. This experiment is repeated until every fail test case is selected to

obtain the average pci for every SBFL metric under this extreme scenario.

Table 3.1 Programs in Siemens Test Suite

Program Faulty

Versions LOC

Number of

Test Cases Description Versions excluded in experiments

print_tokens 7 563 4130 Lexical analyser 1, 2, 4, 6

print_tokens2 10 508 4115 Lexical analyser 10

replace 32 563 5542 Pattern recognition 12, 21, 32

schedule 9 410 2650 Priority scheduler 2, 7

schedule2 10 307 2710 Priority scheduler 9

tcas 41 173 1608 Altitude separation 10, 11 ,13, 14, 15, 31, 32, 33, 36,

38, 40

tot_info 23 406 1052 Information measure 6, 10, 19, 21

26

Second scenario with extreme composition of pass and fail test cases is the “one pass all

fail” scenario. To evaluate the accuracy of SBFL metrics in this scenario, we randomly select

and use the spectra of one pass test case and all fail test cases provided in the Siemens Test

Suite to calculate the SBFL metric scores and obtain the percentage of code inspected (pci) to

locate the faulty line of code. This experiment is repeated until every pass test case is selected to

obtain the average pci for a SBFL metric under this extreme scenario.

The last scenario with extreme composition of pass and fail test cases is the “no pass all

fail” scenario. To evaluate the accuracy of SBFL metrics in this scenario, we remove all pass

test cases and retain only fail test cases for the calculation of SBFL metric scores and the

percentage of code inspected (pci) to locate the faulty line of code.

These three extreme scenarios emulate real life situations where SBFL has to be

conducted with only one fail test case, one pass test case or no pass test case due to test suite

minimization, extremely high or extremely low failure rates or when software testers decide to

stop running more test cases due to time and resource constraints. The experiment results (pci)

for each SBFL metrics under these three extreme scenarios are then compared to the pci under

the “normal scenario” to evaluate the effects of extreme composition of pass and fail test cases

on the accuracy of each SBFL metric.

The accuracy of each SBFL metric is evaluated based on its average pci for all 106 faulty

versions of the seven programs in Siemens Test Suite.

Experiment Result and Discussions 3.3

The experiment result will be analysed in two different ways. Firstly, the analysis on the

percentage of code inspected (pci) to locate the faulty line of code is done based on the average

pci for faulty versions of each programs in Siemens Test Suite. Based on the average pci for

faulty version of each program, the overall average pci for all seven program is then calculated.

Secondly, the percentage of code inspected (pci) to locate the faulty line of code is done based

on the overall average pci for all 106 faulty versions in Siemens Test Suite, without first

calculating the average pci for faulty versions of each program.

27

Table 3.2 present the first analysis of our experiments results where the average pci for

each metric is calculated based on the average pci for faulty versions of each program in

Siemens Test Suite. The first column of the table lists of SBFL metrics under study. The second

column is the SBFL metrics accurcy (in pci) for the “normal scenario” where the spectra of all

pass and fail test cases were included to calculate the SBFL metric scores. The third, fourth and

fifth columns present the accuracy of SBFL metrics the three extreme scenarios, namely “one

fail all pass”, “one pass all fail” and “no pass all fail”. The details of average pci for faulty

version of each program are available in the appendices Table 9.1 to Table 9.7.

Based on the experiment results in Table 3.2, it could be observed that most of the SBFL

metrics, except for Kulczynski1 and Ochiai2, have lower accuracies under the extreme “one fail

all pass” scenario compared to the “normal scenario”. However, some SBFL metrics with

moderate accuracy in “normal scenario” such as {Anderberg, Dice, Jaccard, Sorensen-Dice, QE,

Tarantula, M2, Ochiai} converged to the highest accuracy {Naish1, Naish2, Zoltar} with pci

value 8.16 under the extreme “one fail all pass” scenario.

Under extreme “one pass all fail” scenario, it could be observed that SBFL metrics

{Jaccard, Snderberg, Sorensen-Dice, Dice, Simple_Matching, Sokal, Rogers&Tanimoto,

Hamming_etc, Euclid, Wong3 } have higher accuracies than the “normal scenario”, whereas

SBFL metrics {Wong1, Russel&Rao, Binary} maintained the same accuracy and the remaining

of the SBFL metrics have lower accuracies compared to the “normal scenario”.

Lastly, under the extreme “no pass all fail” scenario, most of the SBFL metrics converge

into the same accuracy groups and mostly perform worse (have higher pci and therefore lower

accuracy) compared to the “normal scenario”.

The best performing SBFL metrics (SBFL metrics with the lowest pci) under each of the

three extreme debugging scenarios has also been identified in Table 3.3. Under the “one fail all

pass” scenario, the best performing SBFL metrics are {Naish1, Anderberg, Dice, Jaccard,

Sorensen-Dice, QE, Tarantula, M2, Ochiai and Zoltar}.

28

Table 3.2 The accuracy of SBFL metrics (in pci) under extreme debugging scenarios with limited test cases.

SBFL Metrics NORMAL SCENARIO 1 Fail All Pass 1 Pass All Fail No Pass All Fail

Naish1 4.89 8.16 6.70 9.44

Naish2 4.80 8.17 6.62 9.36

Jaccard 7.73 8.16 6.62 9.36

Anderberg 7.73 8.16 6.62 9.36

Sorensen-Dice 7.73 8.16 6.62 9.36

Dice 7.73 8.16 6.62 9.36

Tarantula 8.05 8.16 10.62 9.36

QE 8.05 8.16 10.62 13.77

CBI_Inc. 9.03 9.14 29.68 13.77

Simple_Matching 14.94 15.40 6.60 9.36

Sokal 14.94 15.40 6.60 9.36

Rogers&Tanimoto 14.94 15.40 6.60 9.36

Hamming_etc. 14.94 15.40 6.60 9.36

Euclid 14.94 15.40 6.60 9.36

Wong1 9.36 11.78 9.36 9.36

Russel & Rao 9.36 11.78 9.36 9.36

Binary 9.44 11.78 9.44 9.44

Ochiai 6.23 8.16 6.62 9.36

M2 5.62 8.16 6.62 9.36

AMPLE2 9.36 11.78 88.69 9.36

Wong3 10.90 67.51 6.72 9.36

Arithmetic_Mean 7.73 9.15 28.19 9.36

Cohen 8.96 9.15 28.36 9.36

Kulczynski1 13.99 13.72 33.19 55.13

M1 21.30 21.76 34.38 55.13

Ochiai2 9.97 9.69 27.87 55.13

Zoltar 4.81 8.16 6.63 9.36

Ample 11.42 13.38 32.99 9.36

Geometric_Mean 7.96 9.15 28.38 9.36

Harmonic_Mean 8.28 10.57 30.29 9.36

Rogot2 8.28 10.57 30.29 9.36

29

Table 3.3 The accuracy of SBFL metrics under extreme debugging scenarios (sorted by pci).

SBFL Metrics

pci under “1 Fail All

Pass” Scenario

SBFL Metrics

pci under “1 Pass All

Fail” Scenario

SBFL Metrics

pci under “No Pass All

Fail” Scenario

Naish1 8.16 Euclid 6.60

Ample 9.36

Anderberg 8.16 Hamming_etc. 6.60

AMPLE2 9.36

Dice 8.16 Rogers&Tanimoto 6.60

Arithmetic_Mean 9.36

Jaccard 8.16 Simple_Matching 6.60

Cohen 9.36

Sorensen-Dice 8.16 Sokal 6.60

Naish2 9.36

QE 8.16 Naish2 6.62

Anderberg 9.36

Tarantula 8.16 Anderberg 6.62

Dice 9.36

M2 8.16 Dice 6.62

Jaccard 9.36

Ochiai 8.16 Jaccard 6.62

Sorensen-Dice 9.36

Zoltar 8.16 Sorensen-Dice 6.62

Tarantula 9.36

Naish2 8.17 M2 6.62

Euclid 9.36

CBI_Inc. 9.14 Ochiai 6.62

Hamming_etc. 9.36

Arithmetic_Mean 9.15 Zoltar 6.63

Rogers&Tanimoto 9.36

Cohen 9.15 Naish1 6.70

Simple_Matching 9.36

Geometric_Mean 9.15 Wong3 6.72

Sokal 9.36

Ochiai2 9.69 Russel & Rao 9.36

Russel & Rao 9.36

Harmonic_Mean 10.57 Wong1 9.36

Wong1 9.36

Rogot2 10.57 Binary 9.44

Geometric_Mean 9.36

AMPLE2 11.78 QE 10.62

Harmonic_Mean 9.36

Binary 11.78 Tarantula 10.62

M2 9.36

Russel & Rao 11.78 Ochiai2 27.87

Ochiai 9.36

Wong1 11.78 Arithmetic_Mean 28.19

Rogot2 9.36

Ample 13.38 Cohen 28.36

Wong3 9.36

Kulczynski1 13.72 Geometric_Mean 28.38

Zoltar 9.36

Euclid 15.40 CBI_Inc. 29.68

Naish1 9.44

Hamming_etc. 15.40 Harmonic_Mean 30.29

Binary 9.44

Rogers&Tanimoto 15.40 Rogot2 30.29

CBI_Inc. 13.77

Simple_Matching 15.40 Ample 32.99

QE 13.77

Sokal 15.40 Kulczynski1 33.19

Kulczynski1 55.13

M1 21.76 M1 34.38

M1 55.13

Wong3 67.51 AMPLE2 88.69

Ochiai2 55.13

30

On the other hand, under the “one pass all fail” scenario, the best performing SBFL

metrics (SBFL metrics with the lowest pci) are {Euclid, Hamming_etc., Rogers&Tanimoto,

Simple_Matching and Sokal}. Finally, under the “no pass all fail” scenario, the best performing

SBFL metrics (SBFL metrics with the lowest pci) are {Ample, AMPLE2, Arithmetic_Mean,

Cohen, Naish 2, Anderberg, Dice, Jaccard, Sorensen-Dice, Tarantula, Euclid, Hamming_etc.,

Rogers&Tanimoto, Simple_Matching, Sokal, Russel & Rao, Wong1, Geometric_Mean,

Harmonic_Mean, M2, Ochiai, Rogot2, Wong3 and Zoltar}. These findings can serve as a useful

guide for SBFL practitioners to choose the best performing SBFL metrics to use under these

extreme scenarios of limited test cases.

Based on the results in Table 3.3, it could also be observed that the accuracy of SBFL

metrics converge into groups of identical pci values. This is not surprising given that under the

extreme scenario, the spectra coefficient will converge to a single value. For example, under the

“one pass all fail” scenario, the values of aep and anp can only be either 1 or 0. Similarly, under

the “one fail all pass” scenario, the values of aef and anf can only be either 1 or 0. Lastly, under

the “no pass test” scenario, aep and anf will have a value of 0. In these scenarios, groups of

SBFL metrics will evaluate to the same value, hence resulting in groups of identical pci.

The second analysis of experiment results is done based on the overall average pci for all

106 faulty versions in Siemens Test Suite, without first calculating the average pci for faulty

versions of each program. Consistent with the first analysis, the second analysis shows that most

of SBFL metrics performed worse in the three extreme scenarios. To gain better insight to the

accuracy of the SBFL metrics, we perform detailed analysis on these results Table 3.4, Table

3.5, and Table 3.6. In each of these tables, we calculate the percentage of faulty versions that

recorded improvement in pci as “frequency of improvement”. On the other hand, the percentage

of faulty versions that has not recorded a change in pci is shown under the column of

“frequency of equivalent”. Finally, the percentage of faulty versions that recorded decline in pci

is shown under the column of “frequency of decline”. The average improvements and average

decline in pci are shown in the last two columns of Table 3.4, Table 3.5, and Table 3.6.

31

Table 3.4 “one fail all pass” scenario result for 106 numbers of faulty versions in Siemens Test Suite

SBFL Metrics Normal Scenario

(pci)

No Pass All Fail Scenario

(pci)

Frequency of Improvement

(%)

Frequency of

Equivalent (%)

Frequency of Decline

(%)

Average Improvement

Average Decline

Naish1 5.32 8.23 2.83% 17.92% 79.25% 9.19 4.01

Naish2 5.15 8.25 2.83% 17.92% 79.25% 2.94 4.01

Jaccard 7.70 8.23 26.42% 17.92% 55.66% 1.39 1.61

Anderberg 7.70 8.23 26.42% 17.92% 55.66% 1.39 1.61

Sorensen-Dice 7.70 8.23 26.42% 17.92% 55.66% 1.39 1.61

Dice 7.70 8.23 26.42% 17.92% 55.66% 1.39 1.61

Tarantula 7.96 8.23 29.25% 17.92% 52.83% 1.49 1.34

QE 7.96 8.23 29.25% 17.92% 52.83% 1.49 1.34

CBI_Inc. 8.41 8.69 29.25% 17.92% 52.83% 1.49 1.37

Simple_Matching 16.10 16.55 7.55% 55.66% 36.79% 0.00 1.24

Sokal 16.10 16.55 7.55% 55.66% 36.79% 0.00 1.24

Rogers&Tanimoto 16.10 16.55 7.55% 55.66% 36.79% 0.00 1.24

Hamming_etc. 16.10 16.55 7.55% 55.66% 36.79% 0.00 1.24

Euclid 16.10 16.55 7.55% 55.66% 36.79% 0.00 1.24

Wong1 8.18 10.40 3.77% 17.92% 78.30% 2.55 2.96

Russel & Rao 8.18 10.40 3.77% 17.92% 78.30% 2.55 2.96

Binary 8.34 10.40 3.77% 17.92% 78.30% 6.78 2.96

Ochiai 6.43 8.23 5.66% 17.92% 76.42% 0.33 2.38

M2 5.85 8.23 1.89% 17.92% 80.19% 0.00 2.97

AMPLE2 8.18 10.40 3.77% 17.92% 78.30% 2.55 2.96

Wong3 11.16 70.76 1.89% 6.60% 91.51% 0.00 65.12

Arithmetic_Mean 7.41 8.70 14.15% 17.92% 67.92% 1.24 2.16

Cohen 8.36 8.70 29.25% 17.92% 52.83% 1.43 1.42

Kulczynski1 10.89 11.17 29.25% 16.98% 53.77% 2.08 1.65

M1 19.33 19.78 8.49% 54.72% 36.79% 0.06 1.24

Ochiai2 9.22 8.93 32.08% 17.92% 50.00% 3.04 1.38

Zoltar 5.17 8.23 2.83% 17.92% 79.25% 3.50 3.99

Ample 10.54 12.63 5.66% 28.30% 66.04% 0.66 3.22

Geometric_Mean 7.54 8.70 16.04% 17.92% 66.04% 1.47 2.11

Harmonic_Mean 7.72 9.84 18.87% 16.98% 64.15% 1.68 3.79

Rogot2 7.72 9.84 18.87% 16.98% 64.15% 1.68 3.79

32

Table 3.5 “one pass all fail” scenario result for 106 numbers of faulty versions in Siemens Test Suite


(pci)


(pci)


(%)

Frequency of

Equivalent (%)


(%)

Average Improvement

Average Decline

Naish1 5.32 6.59 30.19% 0.00% 69.81% 2.98 3.12

Naish2 5.15 6.43 30.19% 0.00% 69.81% 2.98 3.12

Jaccard 7.70 6.43 53.77% 0.00% 46.23% 5.03 3.11

Anderberg 7.70 6.43 53.77% 0.00% 46.23% 5.03 3.11

Sorensen-Dice 7.70 6.43 53.77% 0.00% 46.23% 5.03 3.11

Dice 7.70 6.43 53.77% 0.00% 46.23% 5.03 3.11

Tarantula 7.96 10.55 29.25% 0.00% 70.75% 4.00 5.32

QE 7.96 10.55 29.25% 0.00% 70.75% 4.00 5.32

CBI_Inc. 8.41 30.39 1.89% 0.00% 98.11% 5.93 22.52

Simple_Matching 16.10 6.39 84.91% 0.00% 15.09% 11.84 2.31

Sokal 16.10 6.39 84.91% 0.00% 15.09% 11.84 2.31

Rogers&Tanimoto 16.10 6.39 84.91% 0.00% 15.09% 11.84 2.31

Hamming_etc. 16.10 6.39 84.91% 0.00% 15.09% 11.84 2.31

Euclid 16.10 6.39 84.91% 0.00% 15.09% 11.84 2.31

Wong1 8.18 8.18 58.49% 2.83% 38.68% 0.00 0.00

Russel & Rao 8.18 8.18 58.49% 2.83% 38.68% 0.00 0.00

Binary 8.34 8.34 58.49% 2.83% 38.68% 0.00 0.00

Ochiai 6.43 6.43 44.34% 0.00% 55.66% 3.88 3.09

M2 5.85 6.43 36.79% 0.00% 63.21% 3.85 3.17

AMPLE2 8.18 86.63 0.00% 0.00% 100.00% 0.00 78.45

Wong3 11.16 6.62 35.85% 0.00% 64.15% 18.30 3.14

Arithmetic_Mean 7.41 28.65 0.94% 0.00% 99.06% 5.34 21.50

Cohen 8.36 28.98 1.89% 0.00% 98.11% 5.93 21.13

Kulczynski1 10.89 30.78 29.25% 0.00% 70.75% 7.85 31.35

M1 19.33 32.48 40.57% 0.00% 59.43% 13.75 31.53

Ochiai2 9.22 29.23 2.83% 0.00% 97.17% 31.14 21.50

Zoltar 5.17 6.45 30.19% 0.00% 69.81% 3.00 3.13

Ample 10.54 32.91 1.89% 0.00% 98.11% 4.33 22.89

Geometric_Mean 7.54 29.02 0.94% 0.00% 99.06% 9.25 21.77

Harmonic_Mean 7.72 30.61 2.83% 0.00% 97.17% 4.71 23.69

Rogot2 7.72 30.61 2.83% 0.00% 97.17% 4.71 23.69

33

Table 3.6 “no pass all fail” scenario result for 106 numbers of faulty versions in Siemens Test Suite


(pci)


(pci)


(%)

Frequency of

Equivalent (%)


(%)

Average Improvement

Average Decline

Naish1 5.32 8.34 25.47% 1.89% 72.64% 4.93 5.89

Naish2 5.15 8.18 25.47% 0.94% 73.58% 4.93 5.82

Jaccard 7.70 8.18 45.28% 0.94% 53.77% 6.52 6.38

Anderberg 7.70 8.18 45.28% 0.94% 53.77% 6.52 6.38

Sorensen-Dice 7.70 8.18 45.28% 0.94% 53.77% 6.52 6.38

Dice 7.70 8.18 45.28% 0.94% 53.77% 6.52 6.38

Tarantula 7.96 8.18 46.23% 0.94% 52.83% 6.73 6.31

QE 7.96 12.65 33.02% 0.94% 66.04% 6.17 10.19

CBI_Inc. 8.41 12.65 33.02% 0.94% 66.04% 7.53 10.19

Simple_Matching 16.10 8.18 73.58% 0.00% 26.42% 13.12 6.59

Sokal 16.10 8.18 73.58% 0.00% 26.42% 13.12 6.59

Rogers&Tanimoto 16.10 8.18 73.58% 0.00% 26.42% 13.12 6.59

Hamming_etc. 16.10 8.18 73.58% 0.00% 26.42% 13.12 6.59

Euclid 16.10 8.18 73.58% 0.00% 26.42% 13.12 6.59

Wong1 8.18 8.18 0.00% 100.00% 0.00% 0.00 0.00

Russel & Rao 8.18 8.18 0.00% 100.00% 0.00% 0.00 0.00

Binary 8.34 8.34 0.00% 100.00% 0.00% 0.00 0.00

Ochiai 6.43 8.18 36.79% 0.94% 62.26% 5.48 6.04

M2 5.85 8.18 32.08% 0.00% 67.92% 5.49 6.03

AMPLE2 8.18 8.18 0.00% 100.00% 0.00% 0.00 0.00

Wong3 11.16 8.18 32.08% 0.94% 66.98% 21.66 5.92

Arithmetic_Mean 7.41 8.18 41.51% 0.94% 57.55% 6.86 6.29

Cohen 8.36 8.18 46.23% 0.94% 52.83% 7.66 6.35

Kulczynski1 10.89 52.86 6.60% 0.00% 93.40% 6.63 45.41

M1 19.33 52.86 9.43% 0.00% 90.57% 10.04 38.08

Ochiai2 9.22 52.86 2.83% 0.00% 97.17% 11.63 45.25

Zoltar 5.17 8.18 26.42% 0.00% 73.58% 4.77 5.80

Ample 10.54 8.18 50.00% 0.00% 50.00% 11.65 6.92

Geometric_Mean 7.54 8.18 40.57% 0.94% 58.49% 7.45 6.25

Harmonic_Mean 7.72 8.18 40.57% 0.00% 59.43% 7.89 6.15

Rogot2 7.72 8.18 40.57% 0.00% 59.43% 7.89 6.15

34

Based on the information of these three tables, it could be observed that the average pci‟s

are different from the first result analysis. This is because, in the first analysis, we average the

pci‟s of all faulty versions of each program. Based on these averages, we calculate the overall

average result for seven programs. However, in the second analysis presented in Table 3.4,

Table 3.5, and Table 3.6, we treat each faulty version as individual program and calculate

average the pci for all 106 faulty.

The second analysis in Table 3.4, Table 3.5, and Table 3.6 is consistent with, and

confirm, the first analysis in Table 3.3. In Table 3.4, it can be observed that the best performing

SBFL metrics (SBFL metrics with the lowest pci) under the “one fail all pass” scenario are

{Naish1, Anderberg, Dice, Jaccard, Sorensen-Dice, QE, Tarantula, M2, Ochiai and Zoltar},

which are consistent with the findings from the first analysis in Table 3.3. In Table 3.5, the best

performing SBFL metrics (SBFL metrics with the lowest pci) in “one pass all fail” scenario are

{Euclid, Hamming_etc., Rogers&Tanimoto, Simple_Matching and Sokal}, which are also

consistent with the findings in Table 3.3. Finally, from Table 3.6, it can be observed that the

best performing SBFL metrics (SBFL metrics with the lowest pci) under the “no pass all fail”

scenario are {Ample, AMPLE2, Arithmetic_Mean, Cohen, Naish2, Anderberg, Dice, Jaccard,

Sorensen-Dice, Tarantula, Euclid, Hamming_etc., Rogers&Tanimoto, Simple_Matching, Sokal,

Russel & Rao, Wong1, Geometric_Mean, Harmonic_Mean, M2, Ochiai, Rogot2, Wong3 and

Zoltar}. This is also consistent with the findings from the first analysis in Table 3.3. It mainly

because there is no pass test case in this scenario which cause the aep and anp to be zero.

Therefore, all SBFL metrics can be simplified to only aef and anf.

Interestingly, from Table 3.5, and Table 3.6, it could be observed that there are groups

(shaded grey in these tables) of SBFL metrics that performed better (have lower pci) under the

“one pass all fail” scenario and “no pass all fail” scenario respectively. From Table 3.5, under

the “one pass all fail” scenario, it can be observed that {Jaccard, Anderberg, Sorensen-Dice,

Dice, Simple_Matching, Sokal, Rogers&Tanimoto, Hamming_etc and Euclid} actually

performed better (have lower pci) in the “one pass all fail” scenario than the “normal scenario”

35

where all pass and fail test cases are used to calculate the SBFL metric scores. From Table 3.6,

under the “no pass all fail” scenario, it can be observed that {Simple_Matching, Sokal,

Rogers&Tanimoto, Hamming_etc., Euclid, Wong3 and Ample} actually performed better (have

lower pci) in the “no pass all fail” scenario than the “normal scenario” where all pass and fail

test cases are used to calculate the SBFL metric scores. These interesting observations suggest

the absence of excessive pass test cases may reduce noisy spectra and improve the accuracy of

these groups SBFL metrics. This finding is significant because noise reduction schemes can be

easily designed to filter or remove these pass test cases in order to achieve the objective of

improving the accuracy of these SBFL metrics.

Conclusion 3.4

In order to save the time and cost in software testing phase, software testers may adopt

test minimization strategies to reduce the number of test cases. These savings at the testing

phase may come at the expense of the accuracy of SBFL metrics in the debugging process. In

the extreme scenarios, the fault localization process may be started with only one fail test case,

one pass test case or no pass test case. In addition to constraints in time and cost of testing, these

scenarios also occur due to extremely high or extremely low failure rates.

However, limited test case execution profiles may reduce the accuracy of SBFL metrics.

In view of this, we evaluated the accuracy of SBFL metrics in these extreme scenarios (to

answer Research Question 1) and identified the best performing SBFL metrics for each of these

scenarios (to answer Research Question 2). From the experiment results, we have further

discovered the convergence in accuracy for SBFL metrics under these extreme scenarios.

An interesting observation from these experiments is that certain groups of SBFL metrics

that performed better under the “one pass all fail” scenario and “no pass all fail” scenario than

the “normal scenario”. This suggests that the absence of excessive pass test cases may reduce

noisy spectra and improve the accuracy of these groups SBFL metrics. This motivate us to

further study potential noise reduction schemes that can filter or remove these test cases with

noisy spectra in the next chapter with the objective to improve the accuracy of SBFL metrics.

36

4 Noise Reduction Schemes for Spectrum-based

Fault Localization

Introduction 4.1

The observation from Chapter 3 explained that the absence of excessive pass test cases

may reduce noisy spectra and improve the accuracy of Spectrum-based Fault Localization

(SBFL) metrics. It motivate us to further explore the ways to reduce noisy spectra by removing

or filtering test cases which may provide contradicting, duplicated and ambiguous information.

We further investigated the pattern of spectra and found that, similar to the findings made in

[24][54], it is common to have a large number of test cases with identical spectrum.

Furthermore, it is also common to have test case imbalance, where most test cases in a test suite

are pass test cases. In other words, the spectra of pass test cases dominate the spectra collection.

In addition to the above, another interesting pattern we found is that many pass and fail

test cases share identical spectrum (execute the same lines of code). Therefore, the spectra of

these test cases present ambiguous and contradiction information for SBFL. We hypothesize

that test cases with duplicated, ambiguous and contradicting information in their spectra could

deteriorate the accuracy of SBFL metrics because these spectra are noise to the SBFL metrics.

Therefore we design noise reduction schemes which focus on eliminate these noisy test cases in

order to improve the accuracy of SBFL metrics.

Figure 4.1 present a motivating example of the presence of noisy spectra. Nine test cases

{a, b, …, i}have been executed on the buggy version of a program. From the spectra of test

cases executed on the buggy version, there are two pairs of test cases that produce identical

spectrum. The first pair is test cases {a, b} and the second pair is test cases {h, i}. Test case a is

a pass test case as it has the same output with correct version. However, test case b is a fail test

case because its output from the faulty version is different from the correct version. Since both

test cases a (a pass test case) and b (a fail test case) share the same spectrum, they provide

37

ambiguous and contradicting information to the SBFL metric. The same observation can be

drawn from the second pair of test cases {h, i}. In other words, the spectra from these two pairs

of test cases are essentially noise for SBFL metrics which in turn may deteriorate the accuracy

of SBFL.

Figure 4.1 Simple Median program for noisy test case example.

Test Case a b c d e f g h i Test Case a b c d e f g h i

argv[1] 2 2 2 2 2 2 2 3 3 argv[1] 2 2 2 2 2 2 2 3 3

argv[2] 3 3 4 4 4 4 4 0 0 argv[2] 3 3 4 4 4 4 4 0 0

argv[3] 3 4 0 1 2 3 4 0 1 argv[3] 3 4 0 1 2 3 4 0 1

output 3 3 2 2 2 3 4 0 1 output 3 4 4 4 4 4 4 0 0

failure x x x x x x x x x failure 0 1 1 1 1 1 0 0 1

Line of Code Line of Code

#include <stdio.h> - - - - - - - - - #include <stdio.h> - - - - - - - - -

- - - - - - - - - - - - - - - - - -

int main (int argc, char *argv[]) 1 1 1 1 1 1 1 1 1 int main (int argc, char *argv[]) 1 1 1 1 1 1 1 1 1

{ - - - - - - - - - { - - - - - - - - -

int med; - - - - - - - - - int med; - - - - - - - - -

med = atoi(argv[3]); 1 1 1 1 1 1 1 1 1 med = atoi(argv[3]); 1 1 1 1 1 1 1 1 1

if (atoi(argv[2]) < atoi(argv[3])) { 1 1 1 1 1 1 1 1 1 if (atoi(argv[2]) > atoi(argv[3])) { /*!!!*/ 1 1 1 1 1 1 1 1 1

if (atoi(argv[1]) < atoi(argv[2])) { # 1 # # # # # # 1 if (atoi(argv[1]) < atoi(argv[2])) { # # 1 1 1 1 # # #

med = atoi(argv[2]); # 1 # # # # # # # med = atoi(argv[2]); # # 1 1 1 1 # # #

} else { - - - - - - - - - } else { - - - - - - - - -

if (atoi(argv[1]) < atoi(argv[3])) { # # # # # # # # 1 if (atoi(argv[1]) < atoi(argv[3])) { # # # # # # # # #

med = atoi(argv[1]); # # # # # # # # # med = atoi(argv[1]); # # # # # # # # #

} - - - - - - - - - } - - - - - - - - -

} - - - - - - - - - } - - - - - - - - -

} else { - - - - - - - - - } else { - - - - - - - - -

if (atoi(argv[1]) > atoi(argv[2])) { 1 # 1 1 1 1 1 1 # if (atoi(argv[1]) > atoi(argv[2])) { 1 1 # # # # 1 1 1

med = atoi(argv[2]); # # # # # # # 1 # med = atoi(argv[2]); # # # # # # # 1 1

} else { - - - - - - - - - } else { - - - - - - - - -

if (atoi(argv[1]) > atoi(argv[3])) { 1 # 1 1 1 1 1 # # if (atoi(argv[1]) > atoi(argv[3])) { 1 1 # # # # 1 # #

med = atoi(argv[1]); # # 1 1 # # # # # med = atoi(argv[1]); # # # # # # # # #

} - - - - - - - - - } - - - - - - - - -

} - - - - - - - - - } - - - - - - - - -

} - - - - - - - - - } - - - - - - - - -

printf("med = %d", med); 1 1 1 1 1 1 1 1 1 printf("med = %d", med); 1 1 1 1 1 1 1 1 1

} 1 1 1 1 1 1 1 1 1 } 1 1 1 1 1 1 1 1 1

input input

spectra

Original Version Buggy Version

38

From this example, it is evident that the accuracy of SBFL is not only affected by the

effectiveness of SBFL metric, but also the noisiness of the spectra collected from the executed

test cases. In this chapter, we propose a suite of noise reduction schemes as pre-processor to

filter out the noisy test cases before their spectra are used for calculation by the SBFL metrics.

With the removal of noisy test cases through these noise reduction schemes, it is envisaged that

the accuracy of SBFL metrics will improve further to reduce the cost and time spent to locate

the faulty line of code.

Previous studies found that different test cases may have similar or identical execution

path (spectrum) [36][37][73]. The similar or identical spectrum might have negative effect on

the accuracy of SBFL metrics, especially in situation like coincidental correctness happens [40].

Therefore, Zhao et al [94] proposed the PAFL technique in order to alleviate the impact of

coincidental correctness. They used the concept of coverage vector to count distinct spectrum,

then calculate failing rate for each coverage vector, and finally refine the executions spectra to

reduce noise.

This chapter makes the following contributions:

1. We propose six noise reduction schemes aimed to improve the accuracy of SBFL

metrics.

2. We empirically evaluate the changes in accuracy for over 30 SBFL metrics as a result

of applying the proposed noise reduction schemes.

3. We propose a guide for practitioners to select the best performing noise reduction

scheme for the SBFL metrics that they are using.

Software Artifacts Used for Empirical Study 4.2

For the experiments conducted in this chapter, we have executed all test cases for each of

seven programs in Siemens Test Suite. For faulty programs of print_tokens,{v4, v6} have been

excluded because there is no faulty code found in these versions. We focus our study on on

single-fault versions. Therefore, print_tokens {v1}, replace {v21}, schedule {v2, v7}, and tcas

{v10, v11, v15, v31, v32, v33, v40} have been excluded because multiple faulty codes exist in

39

these versions. We have also excluded print_tokens {v2}, replace {v12}, tcas {v13, v14, v36,

v38}, tot_info {v6, v10, v19, v21} because the faulty line of code is a non-executable line of

code. In addition, print_tokens2 {v10}, replace {v32}, and schedule2 {v9} have been excluded

because there is no fail test case (that is, failure not detected) even though faulty line of code

exists in the program. We have also excluded the versions which not contain any duplicate or

contradicting test cases, which are print_tokens {v3, v7}, print_tokens2 {v2, v5}, replace {v15,

v17, v19, v20}, schedule {v1, v6, v9}, schedule2 {v4}, tcas {v1, v4, v20, v21, v24, v25, v39,

v41}, and tot_info {v1, v11, v15}. Lastly, schedule {v8}, tcas {v2, v3, v6, v7, v8, v9, v16, v17,

v18, v19, v22, v23, v26, v28, v29, v30, v35, v37} and tot_info {v3, v14} have been excluded

because all failed test cases are eliminated by the one of the proposed noise reduction scheme.

In summary, from total 132 versions of faulty programs, 70 versions have been excluded,

leaving 62 versions to be used in the experiment.

Table 4.1 Programs in Siemens Test Suite.

Program Faulty

Versions LOC

Number of Test Cases

Description Versions excluded in experiments

print_tokens 7 563 4130 Lexical analyser 1, 2, 3, 4, 6, 7

print_tokens2 10 508 4115 Lexical analyser 2, 5, 10

replace 32 563 5542 Pattern recognition 12, 15, 17, 19, 20, 21, 32

schedule 9 410 2650 Priority scheduler 1, 2, 6, 7, 8, 9

schedule2 10 307 2710 Priority scheduler 4, 9

tcas 41 173 1608 Altitude separation 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 13, 14,

15, 16, 17, 18, 19, 20, 21, 22, 23, 24,

25, 26, 28, 29, 30, 31, 32, 33, 35, 36,

37, 38, 39, 40, 41

tot_info 23 406 1052 Information measure 1, 3, 6, 10, 11, 15, 16, 19, 21

40

Problems 4.3

From the analysis conducted on the spectra of faulty program in Siemens Test Suite, we

have observed that, in many versions of the faulty programs, there are existing test cases have

identical spectrum (code execution coverage record) even though the test inputs are different.

These observations on test cases with identical spectrum can be divided into three categories.

1. A fail test case and a pass test case share the same spectrum for a faulty program. In

this case, they provide ambiguous and contradicting information to the SBFL metric

on whether or not the line of code executed by this pair of test cases are likely to be

faulty. Hence, the spectra from this pair of test cases are essentially noise for SBFL

metrics which in turn may deteriorate the accuracy of SBFL.

2. More than one fail test cases share the same spectrum for a faulty program. In this

case, they provide duplicated information to the SBFL metrics on all the lines of code

executed by this set of fail test cases. This duplicated information is essentially noise

to the correct lines of code which are executed by this set of fail test cases, which in

turn may deteriorate the accuracy of SBFL.

3. More than one pass test cases share the same spectrum for a faulty program. In this

case, they provide duplicated information to the SBFL metrics on all the lines of code

executed by this set of pass test cases. This duplicated information is essentially noise

to the faulty lines of code which are executed by this set of pass test cases, which in

turn may deteriorate the accuracy of SBFL.

Based on these observations, it is evident that the duplicated, ambiguous and

contradicting information presence in test cases with identical spectrum may cause inadvertent

deterioration to the accuracy of SBFL metrics.

41

Proposed Noise Reduction Schemes 4.4

To tackle the problems in Section 4.3, we propose six new noise reduction schemes to

eliminate test cases with identical spectrum by comparing the test case execution coverage

information between test cases. The noise reduction schemes are:

1. Noise Reduction Scheme 1 (NRS1) – for each fail test case, all pass test cases with

spectrum identical to the fail test case will be removed and eliminated from the

calculations of SBFL metrics.

2. Noise Reduction Scheme 2 (NRS2) – for each pass test case, all fail test cases with

spectrum identical to the pass test case will be removed and eliminated from the

calculations of SBFL metrics. Note that this noise reduction scheme may not be used

if it results in all fail test cases being removed from the calculation of SBFL metrics

because SBFL requires at least one fail test case to be presence.

3. Noise Reduction Scheme 3 (NRS3) – for each set of pass and fail test cases with

identical spectrum, all test cases (pass test cases and fail test cases) in the set will be

removed and eliminated from the calculations of SBFL metrics. Note that this noise

reduction scheme may not be used if it results in all fail test cases being removed from

the calculation of SBFL metrics because SBFL require at least one fail test case to be

presence.

4. Noise Reduction Scheme 4 (NRS4) – This technique was proposed by Lee, Naish and

Ramamohanarao in [54] in a related study on the effect of using non-redundant test

cases for SBFL. For each set of pass test cases with identical spectrum, all except one

test case will be removed and eliminated from the calculations of SBFL metrics.

Similarly, for each set of fail test cases with identical spectrum, all except one test

case will be removed and eliminated from the calculations of SBFL metrics.

5. Noise Reduction Scheme 5 (NRS5) – This noise reduction scheme is a cascading of

NRS4 and NRS 1. In other words, NRS4 is applied to the test cases before NRS1

42


NRS4 and NRS 2. In other words, NRS4 is applied to the test cases before NRS2.


NRS4 and NRS 3. In other words, NRS3 is applied to the test cases before NRS3.

Each of these noise reduction schemes is applied to the executed test suite (the set of test

cases executed) of the faulty program to remove and eliminate test cases with identical spectrum

with the aim improve the accuracy of SBFL metrics.

Experiment Results 4.5

In order to validate our observation that there exist test cases with identical spectrum in

faulty program, experiment has been conducted on faulty versions of programs in Siemens Test

Suite. Table 4.2 shows the number of test cases with identical spectrum that have been detected

and successfully removed by the proposed noise reduction schemes. Due to the space limitation,

only the numbers of test cases which have been removed for NRS1 to NRS3 are shown as

indication. It can be observed that the percentage of test cases detected to have identical

spectrum and removed by the noise reduction schemes ranges from 0.45% to 27.48% for the

programs in Siemens Test Suites. This implies that, without noise reduction scheme, the levels

of noise presence in the Siemens Test Suite can be significantly high, which in turn may

deteriorate the accuracy of SBFL.

Experiments have been conducted by applying the proposed noise reduction scheme on

programs in Siemens Test Suite to evaluate the effects of these schemes on the accuracy of

SBFL metrics. The accuracy of a SBFL metric can be evaluated with the percentage of code

inspected (pci) to locate the faulty line of code.

43

Table 4.2 The number of test cases removed by the proposed noise reduction scheme.

Program

No. of Faulty

Versions Tested

No. of Test Cases in

each Faulty Version

Total No. of Test Cases

for all Faulty

versions


Removed by NRS1


Removed by NRS2


Removed by NRS3

print_tokens 1 4130 4130 63

(1.53%)

17

(0.41%)

80

(1.94%)

print_tokens2 7 4115 28805 877

(3.04%)

200

(0.69%)

1077

(3.74%)

replace 25 5542 138550 1212

(0.87%)

624

(0.45%)

1836

(1.33%)

schedule 3 2650 7950 1743

(21.92%)

442

(5.56%)

2185

(27.48%)

schedule2 8 2710 21680 1423

(6.56%)

195

(0.90%)

1618

(7.46%)

tcas 4 1608 6432 1076

(16.73%)

149

(2.32%)

1225

(19.05%)

tot_info 14 1052 14728 2100

(14.26%)

324

(2.20%)

2424

(16.46%)

The experiment results will be analysed in two different ways (similar with previous

chapter). In the first analysis, the analysis on the percentage of code inspected (pci) to locate the

faulty line of code is done based on the average pci for faulty versions of each of the seven

programs in Siemens Test Suite. Based on the average pci for faulty version of each program,

the overall average pci for all seven programs is then calculated. In the second analysis, the

percentage of code inspected (pci) to locate the faulty line of code is done based on the overall

average pci for all 62 faulty versions in Siemens Test Suite, without first calculating the average

pci for faulty versions of each program.

The first analysis on the pci of the SBFL metrics under each noise reduction scheme is

presented in Table 4.3. The detailed accuracies of each of the seven programs are provided

Table 9.8 to Table 9.14 in the appendix. As the benchmark for comparison, the pci of SBFL

metrics when there is no noise reduction scheme (NRS) used and presented in the left most

column of Table 4.3. A ↑ symbol is used to indicate that the accuracy of the SBFL metrics has

44

improved (that is, pci becomes lower) under an NRS in comparison to the accuracy when there

is no NRS is used. Conversely, a ↓ symbol is used to indicate that the accuracy of the SBFL

metrics has worsen (that is, pci score becomes higher) under an NRS in comparison to the

accuracy when no NRS is used. Lastly, a – is used to indicate that the accuracy of the SBFL

metrics is unchanged (that is, pci remains unchanged) under an NRS in comparison to the

accuracy when no NRS is used.

Table 4.3 The pci of SBFL metrics under the proposed noise reduction schemes.

SBFL Metrics No NRS NRS1 NRS2 NRS3 NRS4 NRS5 NRS6 NRS7

Naish1 5.99 5.99 - 6.70 ↓ 6.70 ↓ 5.98 ↑ 5.98 ↑ 6.72 ↓ 6.70 ↓

Naish2 5.89 5.89 - 6.60 ↓ 6.60 ↓ 5.89 ↑ 5.89 ↑ 6.62 ↓ 6.60 ↓

Jaccard 10.16 9.90 ↑ 10.10 ↑ 9.84 ↑ 8.70 ↑ 8.86 ↑ 8.59 ↑ 8.57 ↑

Anderberg 10.16 9.90 ↑ 10.10 ↑ 9.84 ↑ 8.70 ↑ 8.86 ↑ 8.59 ↑ 8.57 ↑

Sorensen-Dice 10.16 9.90 ↑ 10.10 ↑ 9.84 ↑ 8.70 ↑ 8.86 ↑ 8.59 ↑ 8.57 ↑

Dice 10.16 9.90 ↑ 10.10 ↑ 9.84 ↑ 8.70 ↑ 8.86 ↑ 8.59 ↑ 8.57 ↑

Tarantula 10.80 10.66 ↑ 10.50 ↑ 10.18 ↑ 9.58 ↑ 9.64 ↑ 9.18 ↑ 9.14 ↑

QE 10.80 10.66 ↑ 10.50 ↑ 10.18 ↑ 9.58 ↑ 9.64 ↑ 9.18 ↑ 9.14 ↑

CBI_Inc. 10.80 10.66 ↑ 10.50 ↑ 10.18 ↑ 9.58 ↑ 9.64 ↑ 9.18 ↑ 9.14 ↑

Simple_Matching 17.17 16.73 ↑ 17.27 ↓ 16.87 ↑ 16.57 ↑ 16.25 ↑ 16.79 ↑ 16.57 ↑

Sokal 17.17 16.73 ↑ 17.27 ↓ 16.87 ↑ 16.57 ↑ 16.25 ↑ 16.79 ↑ 16.57 ↑

Rogers&Tanimoto 17.17 16.73 ↑ 17.27 ↓ 16.87 ↑ 16.57 ↑ 16.25 ↑ 16.79 ↑ 16.57 ↑

Hamming_etc. 17.17 16.73 ↑ 17.27 ↓ 16.87 ↑ 16.57 ↑ 16.25 ↑ 16.79 ↑ 16.57 ↑

Euclid 17.17 16.73 ↑ 17.27 ↓ 16.87 ↑ 16.57 ↑ 16.25 ↑ 16.79 ↑ 16.57 ↑

Wong1 10.45 10.45 - 10.99 ↓ 10.99 ↓ 10.45 - 10.45 - 10.99 ↓ 10.99 ↓

Russel & Rao 10.45 10.45 - 10.99 ↓ 10.99 ↓ 10.45 - 10.45 - 10.99 ↓ 10.99 ↓

Binary 10.55 10.55 - 11.09 ↓ 11.09 ↓ 10.55 - 10.55 - 11.09 ↓ 11.09 ↓

Ochiai 7.78 7.88 ↓ 7.96 ↓ 7.91 ↓ 6.98 ↑ 7.22 ↑ 7.45 ↑ 7.42 ↑

M2 6.92 6.97 ↓ 7.24 ↓ 7.22 ↓ 6.65 ↑ 6.68 ↑ 7.22 ↓ 7.20 ↓

AMPLE2 10.45 10.45 - 10.99 ↓ 10.99 ↓ 10.45 - 10.45 - 10.99 ↓ 10.99 ↓

Wong3 6.17 6.17 - 18.78 ↓ 18.84 ↓ 6.31 ↓ 6.39 ↓ 20.61 ↓ 17.56 ↓

Arithmetic_Mean 8.64 8.62 ↑ 8.66 ↓ 8.49 ↑ 8.43 ↑ 8.61 ↑ 8.21 ↑ 8.17 ↑

Cohen 10.65 10.27 ↑ 10.31 ↑ 10.09 ↑ 9.32 ↑ 9.35 ↑ 8.88 ↑ 8.85 ↑

Kulczynski1 10.16 9.90 ↑ 10.10 ↑ 9.76 ↑ 8.70 ↑ 8.86 ↑ 8.59 ↑ 8.49 ↑

M1 17.17 16.73 ↑ 17.27 ↓ 16.79 ↑ 16.57 ↑ 16.25 ↑ 16.79 ↑ 16.48 ↑

Ochiai2 10.88 10.67 ↑ 10.19 ↑ 9.99 ↑ 10.74 ↑ 10.63 ↑ 10.36 ↑ 10.32 ↑

Zoltar 5.90 6.06 ↓ 6.62 ↓ 6.62 ↓ 5.89 ↑ 5.97 ↓ 6.64 ↓ 6.61 ↓

Ample 10.74 10.86 ↓ 10.87 ↓ 10.75 ↓ 10.45 ↑ 10.69 ↑ 10.81 ↓ 10.77 ↓

Geometric_Mean 8.97 8.92 ↑ 8.55 ↑ 8.46 ↑ 8.39 ↑ 8.52 ↑ 8.29 ↑ 8.27 ↑

Harmonic_Mean 9.19 9.08 ↑ 8.70 ↑ 8.57 ↑ 8.65 ↑ 8.88 ↑ 8.47 ↑ 8.45 ↑

Rogot2 9.19 9.08 ↑ 8.70 ↑ 8.57 ↑ 8.65 ↑ 8.88 ↑ 8.47 ↑ 8.45 ↑

45

Based on the experiment results in Table 4.3, it can be observed that, under NRS1, NRS3,

NRS4, NRS5, NRS6 and NRS7, the pci for most SBFL metrics have become lower, which

means that the accuracies of these SBFL metrics have improved. The only exception is NRS2,

which only bring improvement to a small number of SBFL metrics. These observations are

consistent with the findings from Chapter 3 where removal of excessive pass test cases

improves the accuracy of SBFL.

It could also be noted that all SBFL metrics have achieved an improvement in SBFL

accuracy (lower pci) through one or more of the proposed NRS, except {Wong1, Russel & Rao,

Binary, AMPLE2, Wong3}.

The second analysis of experiment results is done based on the overall average pci for all

62 faulty versions in Siemens Test Suite, without first calculating the average pci for faulty

versions of each program. The second analysis on the effect of the proposed NRS on the

accuracy of SBFL metrics is presented in Table 4.4. In this table we compare the accuracy of

each SBFL metric when no NRS is applied with the accuracy of the same SBFL metric when

the best performing NRS is applied. In addition to the average pci for each SBFL metric, we

also count the frequency of faulty versions have recorded improvement, equal, and decline in

accuracy as percentages. Lastly, the average improvement and average decline have also been

calculated, as shown Table 4.4.

In Table 4.4, the SBFL metrics shaded in grey are those that have recorded an

improvement in accuracy when the best performing NRS is applied. These SBFL metrics are

{Jaccrad, Anderberg, Sorensen-Dice, Dice, Tarantula, QE, CBI_Inc., Simple_Matching, Sokal,

Rogers&Tanimoto, Hamming_etc., Euclid, Ochiai, Arithmetic_Mean, Cohen, Kulczynakil, M1,

Ochiai2, Geometric_Mean, Harmonic Mean, Rogot2}. Consistent with the findings in Table

4.3, {Wong1, Russel & Rao, Binary, AMPLE2, Wong3} recorded a decline in SBFL accuracy

(higher pci). However, {Naish1, Naish2, M2, Zoltar and Ample} which shown a slight

improvement in the first analysis in Table 4.3 have recorded a decline in SBFL accuracy in the

second analysis in Table 4.4. Therefore, these improvements have to be further analysed and

46

interpreted with care by taking into consideration the frequency of improvement and average

improvement. A higher frequency of improvement compared to frequency of decline means that

more faulty versions recorded an improvement in SBFL accuracy than decline in accuracy, and

vice versa. Therefore, in addition to a lower average pci (higher SBFL accuracy), a higher

frequency of improvement compared to frequency of decline is desirable. Likewise, a higher

average improvement in pci is more desirable.

Table 4.4 Analysis of pci for 62 faulty versions with the best performing NRS.

SBFL Metrics Pci With No NRS

The Best Performing

NRS

Pci With The Best

Performing NRS

Frequency of

Improvement (%)

Frequency of

Equivalent (%)

Frequency of

Decline (%)

Average Improvement

Average Decline

Naish1 5.44 NRS5 5.67 11.29% 77.42% 11.29% 2.61 4.64

Naish2 5.17 NRS5 5.40 11.29% 77.42% 11.29% 2.61 4.64

Jaccard 8.87 NRS7 8.25 46.77% 32.26% 20.97% 3.56 5.01

Anderberg 8.87 NRS7 8.25 46.77% 32.26% 20.97% 3.56 5.01

Sorensen-Dice 8.87 NRS7 8.25 46.77% 32.26% 20.97% 3.56 5.01

Dice 8.87 NRS7 8.25 46.77% 32.26% 20.97% 3.56 5.01

Tarantula 9.22 NRS7 8.69 40.32% 27.42% 32.26% 4.04 3.41

QE 9.22 NRS7 8.69 40.32% 27.42% 32.26% 4.04 3.41

CBI_Inc. 9.22 NRS7 8.69 40.32% 27.42% 32.26% 4.04 3.41

Simple_Matching 17.39 NRS5 17.32 41.94% 19.35% 38.71% 3.73 3.86

Sokal 17.39 NRS5 17.32 41.94% 19.35% 38.71% 3.73 3.86

Rogers&Tanimoto 17.39 NRS5 17.32 41.94% 19.35% 38.71% 3.73 3.86

Hamming_etc. 17.39 NRS5 17.32 41.94% 19.35% 38.71% 3.73 3.86

Euclid 17.39 NRS5 17.32 41.94% 19.35% 38.71% 3.73 3.86

Wong1 9.24 NRS5 9.24 0.00% 100.00% 0.00% 0.00 0.00

Russel & Rao 9.24 NRS5 9.24 0.00% 100.00% 0.00% 0.00 0.00

Binary 9.52 NRS5 9.52 0.00% 100.00% 0.00% 0.00 0.00

Ochiai 6.91 NRS4 6.51 43.55% 40.32% 16.13% 2.63 4.63

M2 6.07 NRS4 6.18 30.65% 50.00% 19.35% 1.83 3.47

AMPLE2 9.24 NRS5 9.24 0.00% 100.00% 0.00% 0.00 0.00

Wong3 5.92 NRS1 5.92 0.00% 100.00% 0.00% 0.00 0.00

Arithmetic_Mean 7.70 NRS7 7.56 41.94% 35.48% 22.58% 2.32 3.66

Cohen 9.17 NRS7 8.56 43.55% 32.26% 24.19% 3.80 4.33

Kulczynski1 8.87 NRS7 8.22 50.00% 29.03% 20.97% 3.41 5.01

M1 17.39 NRS5 17.32 41.94% 19.35% 38.71% 3.73 3.86

Ochiai2 10.01 NRS3 9.45 41.94% 45.16% 12.90% 1.65 1.07

Zoltar 5.19 NRS4 5.41 12.90% 75.81% 11.29% 2.35 4.64

Ample 9.83 NRS4 10.74 35.48% 27.42% 37.10% 3.07 5.39

Geometric_Mean 7.89 NRS7 7.59 40.32% 35.48% 24.19% 2.83 3.48

Harmonic_Mean 8.11 NRS7 7.83 43.55% 27.42% 29.03% 2.82 3.25

Rogot2 8.11 NRS7 7.83 43.55% 27.42% 29.03% 2.82 3.25

47

From Table 4.4, it could be observed that all of the SBFL metrics that recorded an

improvement in SBFL accuracy (lower average pci) have a higher frequency of improvement

compared to frequency of decline, which is desirable. However, out of these SBFL metrics with

improved SBFL accuracy, only {Tarantula, QE, CBI_Inc. and Ochiai2} have the desired higher

average improvement compared to average decline. Therefore, we rate these four best

performing SBFL metrics when used in conjunction with the NRS. The corresponding NRS for

each of these SBFL metrics are: Ochiai2 (NRS3), Tarantula (NRS7), QE (NRS7), and CBI_Inc

(NRS7).

Discussion 4.6

From the experiment results in Section 4.5, the accuracy of most SBFL metrics has

improved when noise reduction scheme NRS1 is applied to remove pass test cases with

identical spectrum to the fail test cases. Conversely, only few SBFL metrics has shown

improvements in their accuracy when NRS2 is applied to remove failed test cases with identical

spectrum to the pass test cases. These observations suggest that fail test cases carries more

information on the location of the faulty lines of code than pass test cases, which justify why

NRS1 performs better than NRS2. This is also consistent with the finding from Chapter 3 where

removal of excessive pass test cases improves the accuracy of SBFL.

On the other hand, the SBFL demonstrates a mix accuracy performance under the NRS3,

where both pass and fail test cases with identical spectrum are removed. Thus, there is less noise

in the resulting spectra compared to NRS1 and NRS2. However, this may also cause the

essential information about the location of faulty line of code is lost when the fail test cases are

removed together with the pass test cases.

In NRS5, NRS6 and NRS7, when NRS4 is applied to the test suite before the NRS1,

NRS2 and NRS3, we could observe the accuracy of SBFL metrics are better compared to when

NRS1, NRS2 and NRS3 is applied alone. In particular, we could observe that NRS7, which

cascade NRS4 and NRS3, delivers the best accuracy for many SBFL metrics. This is a

significant observation because NRS7 is the noise reduction scheme that removes most number

48

of test cases with noisy spectra compared to all other noise reduction scheme studied. Having

said that, even though NRS7, which cascades NRS4 and NRS3, delivers the best accuracy for

most number of SBFL metrics studied, it deteriorates the accuracy of few SBFL metrics.

Therefore, selecting the best performing noise reduction scheme for the SBFL Metrics is

essential in practice.

Table 4.5 Guide to choose noise reduction scheme for SBFL practitioners.

SBFL Metric NRS Recommended / Highly Recommended

Ample - - Ample2 - - Anderberg NRS7 Recommended Arithmetic Mean NRS7 Recommended Binary - - CBI Inc NRS7 Highly Recommended Cohen NRS7 Recommended Euclid NRS5 Recommended Geometric Mean NRS7 Recommended Hamming etc NRS5 Recommended Harmonic Mean NRS7 Recommended Jaccard NRS7 Recommended Kulczynski1 NRS7 Recommended M1 NRS5 Recommended M2 NRS4 Recommended Naish1 - - Naish2 - - Ochiai NRS4 Recommended Ochiai2 NRS3 Highly Recommended QE NRS7 Highly Recommended Rogers & Tanimoto NRS5 Recommended Rogot2 NRS7 Recommended Russel & Rao - - Simple Matching NRS5 Recommended Sokal NRS5 Recommended Sorensen-Dice NRS7 Recommended Tarantula NRS7 Highly Recommended Wong1 - - Wong3 - - Zoltar - -

49

Based on improvement in the accuracy of SBFL metrics in Table 4.3 and Table 4.4, we

have compiled a guide for SBFL practitioners to select the best performing noise reduction

scheme for the SBFL metrics that they use in Table 4.5. The “Recommended” rows are those

SBFL metrics that have a higher frequency of improvement compared to the frequency of

decline in accuracy of SBFL when the best performing NRS is applied, but have a lower

average improvement than average decline in the accuracy of SBFL. On the other hand, the

“Highly Recommended” rows are those SBFL metrics that not only have a higher frequency of

improvement compared to the frequency of decline in accuracy of SBFL when the best

performing NRS is applied, but also have a higher average improvement than average decline in

the accuracy of SBFL.

Conclusion 4.7

In this chapter, we propose six new noise reduction schemes to remove and eliminate test

cases which provide duplicated, ambiguous and contradicting information and evaluate the

resulting improvements in accuracy for over 30 SBFL metrics under study. From the

experiments conducted on 62 faulty versions of programs in Siemens Test Suite, we found that

as high as 27% test cases in Siemens Test Suite have identical spectrum. This significantly high

percentage of test cases with identical spectrum is essentially noise to the SBFL metrics which

may in turn to deteriorate the accuracy of SBFL metrics.

Experiments were conducted by applying the six proposed noise reduction schemes on


SBFL metrics. The experiment results showed that the proposed noise reduction schemes have

successfully improved the accuracy of SBFL metrics under study. In addition, we have further

identified the best performing noise reduction scheme for each SBFL metric. Based on our

findings, we further provide a guide for SBFL practitioners to select the best performing noise

reduction scheme for the SBFL metrics that they use.

50

In order to enable and support the application these noise reduction schemes in practice,

the design and development of practical SBFL tool with test case pre-processor will be

presented and discussed in the next chapter.

51

5 Spectrum-based Fault Localization Tool with Test

Case Pre-processor

Introduction 5.1

Recent study on Spectrum-based Fault Localization (SBFL) using non-redundant test

cases [54] as well as our study chapter 3 and chapter 4 have found that duplicated, contradicting

and ambiguous information that exist in test case execution profile (spectrum) may deteriorate

the accuracy of SBFL metrics. For example, if the same set of lines of code is being executed by

a pass test case and a fail test case, the spectrum gathered from this pair of test cases is

essentially noise to the SBFL metrics. This is because they do not provide useful information to

the SBFL metrics to distinguish between correct and faulty line of code. On the other hand, the

presence of excessive number of test cases with duplicate or identical spectrum may also result

in biased ranking by SBFL metrics and inadvertently deteriorate to the accuracy of SBFL

metrics. Therefore, by eliminating the contradicting and duplicated spectrum, the accuracy of

many existing SBFL metrics could be improved, as found in the studies conducted in Chapter 3,

and Chapter 4.

In this chapter, we propose and develop a SBFL tool with a novel built-in test case pre-

processor. Recently, attempts have been made to develop tools to enable practical application of

SBFL by software developers in real-life projects. In 2009, Janssen, Abreu, and van Gemund

[44] developed an automatic fault localization toolset to support the application of SBFL in

practice, named Zoltar [43]. In 2012, Campos et al further developed Zoltar into GZoltar, which

is Eclipse IDE plug-in for testing and debugging [12].

52

Figure 5.1 Block Diagram of the Proposed SBFL Tool

Different from existing SBFL tools in [12][44][43] which focus on SBFL metrics, we

have developed a SBFL tool with a novel test case pre-processor that allows users to filter out

test cases that provide contradicting, duplicated or other noisy spectrum. As shown in Figure

5.1, an additional stage of pre-processing filtering process is carried out before SBFL metrics

are being used to analyse the spectrum and rank the lines of code according to their

suspiciousness to be faulty. This test case pre-processor can be used by all existing SBFL

metrics proposed by other researchers. We have embedded nine Noise Reduction Scheme

(NRS), as test case pre-processor in order to eliminate noise in spectrum collection into the

proposed tool. This tool is also able to automatically select the best performing Noise Reduction

Scheme to improve the accuracy of the SBFL metric selected by the user.

This chapter makes the following contributions:

1. We added a “test case preprocessing” stage to SBFL, which, to the best of our

knowledge, is novel to the Spectrum-based Fault Localization (SBFL) technique.

2. The first SBFL tool that includes a Noise Reduction Scheme (NRS) as test case

preprocessor was developed to filter out test cases that provide contradicting,

duplicated or other noisy spectrum.

3. We conducted case studies on Siemens programs to evaluate and demonstrate the

effectiveness of the proposed SBFL tool with test case preprocessing in improving the

accuracy of existing SBFL metrics.

53

Pre-processing Scheme 5.2

Previous empirical studies in chapter 3 and chapter 4 have found that the accuracy of

SBFL metrics could be improved through removal of certain categories of test cases that

provide contradicting, duplicated or other noisy spectrum. Based on these findings, we

implemented nine NRSs (two NRSs from Chapter 3 and seven NRSs from Chapter 4) in the

SBFL tool to eliminate such test cases. The nine NRS are outlined below:

1. Noise Reduction Scheme 1 (NRS1) – Remove all pass test cases with identical

spectrum to any fail test case.

2. Noise Reduction Scheme 2 (NRS2) – Remove all fail test cases with identical

spectrum to any pass test. Note that this Noise Reduction Scheme may cause all fail

test cases to be removed. Without any fail test case, SBFL cannot be carried out.

Therefore if all fail test cases are identical to pass test cases, this Noise Reduction

Scheme will not remove any test case.

3. Noise Reduction Scheme 3 (NRS3) – For all pass and fail test case share a same

spectrum within one to another, test case from both categories will be remove. Similar

term with NRS2 if this filter could remove all fail test case, this filter will not be

applied.

4. Noise Reduction Scheme 4 (NRS4) – For each set of pass test cases with identical

spectrum, remove all except one test case. Similarly, for each set of fail test cases with

identical spectrum, remove all except one test case.

5. Noise Reduction Scheme 5 (NRS5) – This filter is the combination of NRS4 and

NRS1, where NRS4 is applied before NRS1.





54

8. Noise Reduction Scheme 8 (NRS8) – Remove all existing pass test cases from the test

suite.

9. Noise Reduction Scheme 9 (NRS9) – Randomly select one pass test case and remove

all other existing pass test cases from the test suite.

Noise Reduction Schemes NRS1, NRS2, NRS3, NRS5, NRS6 and NRS7 are developed

based on the findings in chapter 4 while NRS8 and NRS9 are developed based on the findings

in chapter 3. On the other hand, NRS4 is adopted from the study on using non-redundant test

cases for SBFL in [54]. The impact of each pre-processing scheme on SBFL metrics differ from

metric to the other as each SBFL metric is designed differently. The set of SBFL metrics that

benefit most from test case removal in each of these pre-processing schemes have also been

identified in chapter 3 and chapter 4.

SBFL Tool with Test Case Pre-processor Tool 5.3

In this chapter, we propose and develop a novel Spectrum-based Fault Localization

(SBFL) tool. Different from existing SBFL tools in [12][44][43], this SBFL tool has an

integrated test case pre-processor (as shown in Figure 5.1) that can remove noisy test cases to

improve the accuracy of SBFL metrics. The proposed tool also supports all existing SBFL

metrics (listed in Chapter 2, Section 2.3).

The UML interaction diagram of this tool is shown in Figure 5.2. Screenshots of the GUI

for this tool is shown in Figure 5.3 to Figure 5.6. Firstly, the first tab on the GUI of the tool is a

file browser which allows it user to browse to the file location of the input files. The SBFL tool

requires the source code of the faulty program and its program spectrum as inputs, as show in

Figure 5.3. Secondly, the Test Case Pool tab allows the user to choose which SBFL metric to

rank the lines of code in the program according to their suspiciousness to be faulty. In addition,

the user can choose the Noise Reduction Scheme (NRS) to be used.

55

Figure 5.2 UML Interaction Diagram

Figure 5.3 File Browser Screenshot

56

Figure 5.4 Test Case Pool Screenshot

Figure 5.5 SBFL result screenshot

57

Figure 5.6 Log record of the tool

The main purpose of these NRSs is to improve the accuracy of the SBFL metric chosen.

If the user is unsure of which NRS to choose, there is a recommend filter button which assigns

the best performing NRS for the SBFL metric chosen according to the findings from empirical

studies conducted in chapter 3 and chapter 4 of this thesis. For example, in Figure 5.4, when

Ochiai2 is chosen as SBFL metric, NRS3 is recommended as the Noise Reduction Scheme

(NRS) because empirical studies conducted in chapter 4 found that the accuracy of Ochiai2

improve most when NRS3 is used.

In Result tab in Figure 5.5, the tool presents detailed information of faulty program,

SBFL metric used, and the test case pre-processing scheme applied. The result can be shown in

the original sequence of source code or by the ranking calculated with the chosen SBFL metric,

where the line of code ranked top is more likely to be faulty than the lower ranked lines. All

execution history of this tool is logged in the last for history tracking as shown in Figure 5.6.

58

Case Study on Siemens Test Suite 5.4

In this section, we conduct case study by using the proposed SBFL tool with test case pre-

processing scheme to locate faults in one faulty version selected from each program in Siemens

Test Suit. The following seven faulty programs are selected for this case study: print_tokens v5,

print_tokens2 v3, replace v2, schedule v4, schedule2 v8, tcas v5, and tot_info v12. As a case

study, we use nine SBFL metrics, namely Naish1, Jaccard, Dice, Tarantula, Ochiai, M2,

Arithmetic_Mean, Ochiai2, and Zoltar to rank the code in the selected faulty versions according

to their likeliness to be faulty. With seven faulty programs and nine SBFL metrics, a total of 63

cases have been trialled on the proposed tool.

The result of this case study is presented in Table 5.1. The accuracies of each SBFL

metric without NRS and with NRS are compared in terms of the percentage of code inspected

(pci) before the faulty line is located. In other words, we compare the accuracy of SBFL metrics

before and after NRS is applied. Out of 63 cases studied, there are 40 cases (or 63% of the

cases) where the accuracy of SBFL metrics actually improve after the NRS is applied. These

cases are highlighted in green cells in Table 5.1. The largest improvement in pci is observed for

the case where PPS-7 is applied to test cases for print_tokens v5 before Jaccard is being used as

the SBFL metric to rank the code according the suspiciousness to be faulty. It could be observed

that the pci has improved more than 90% (from 3.73% to 0.36%). In other words, a software

developer only needs to inspect 0.36% of code (instead of 3.73%) before successfully locating

faulty line of code.

On the other hand, there are only four cases (or 6% of the cases) where the accuracy of

SBFL metrics deteriorate after the PPS is applied. These worse accuracy performances have

been observed in four SBFL metrics for faulty program replace v2. These cases are highlighted

in red cells in Table III. A possible reason for these worse accuracy performances is that the

duplicate and conflicting test cases removed by the PPS also contain useful information that

contribute positively to the ranking accuracy of the SBFL metrics.

59

Finally, out of the 63 cases studied, there are 19 cases (or 30% of the cases) where the

accuracy of SBFL metrics remains unchanged after the PPS is applied. There are two possible

reasons for this observation. Firstly, the test cases used may not have test cases with duplicate,

conflicting or other noisy spectra. Therefore, applying PPS will not have any effect on the

accuracy of the SBFL metrics. Secondly, the differences in SBFL metric values between the

faulty line of code and other lines can be too large then by applying test cases PPS to remove

the noisy spectra does not alter the ranking of the faulty line of code and other lines.

Table 5.1 Siemens Test Suite case study

SBFL Metrics Mode

Percentage of Code Inspected (pci) to Locate the Faulty Line of Code

print_tokens print_tokens2 replace schedule schedule2 tcas tot_info

v5 v3 v2 v4 v8 v5 v1 2

Naish1

(NRS Selected: NRS5)

Without NRS 0.36 5.89 0.36 0.97 20.20 13.87 4.19

With NRS 0.36 5.89 0.36 0.97 14.66 13.87 3.69

Jaccard


Without NRS 3.73 17.88 6.93 4.62 29.97 19.65 7.14

With NRS 0.36 8.84 9.95 1.46 18.89 17.34 4.19

Dice


Without NRS 3.73 17.88 6.93 4.62 29.97 19.65 7.14

With NRS 0.36 8.84 9.95 1.46 18.89 17.34 4.19

Tarantula


Without NRS 6.93 17.88 7.46 4.62 29.97 19.65 7.64

With NRS 1.95 9.43 9.95 1.46 18.89 17.34 4.68

Ochiai


Without NRS 0.36 7.27 0.71 2.68 23.13 15.03 6.16

With NRS 0.36 7.07 0.71 0.97 17.59 14.45 3.69

M2


Without NRS 0.36 7.27 0.71 0.97 22.48 13.87 4.43

With NRS 0.36 7.07 0.71 0.97 16.94 13.87 3.69

Arithmetic_Mean


Without NRS 0.53 10.81 1.78 1.70 24.43 18.50 6.65

With NRS 0.36 7.07 1.78 1.46 18.57 17.34 4.19

Ochiai2


Without NRS 0.53 19.25 3.20 4.62 33.22 20.23 7.14

With NRS 0.53 19.25 3.37 1.46 31.92 17.34 6.65

Zoltar


Without NRS 0.36 5.89 0.36 0.97 20.20 13.87 4.19

With NRS 0.36 5.89 0.36 0.97 14.66 13.87 3.69

60

Conclusion 5.5

Spectrum-based Fault Localization (SBFL) has been proven to be an effective approach

to locate the faulty line of code during the debugging process. Research efforts in the area of

SBFL have been focusing on designing more accurate SBFL metrics to efficiently locate faulty

code in software. In addition, software tools have also been developed to help software

developer to adopt SBFL in practice. However, studies conducted in Chapter 3 and Chapter 4 of

this thesis has found that the presence of contradicting and duplicated test case execution profile

(spectra) may reduce the accuracy of most SBFL metrics.

In this chapter, we propose and develop an SBFL tool with a built-in Noise Reduction

Schemes (NRS) as test case pre-processor. Different from existing SBFL tools which focus on

SBFL metrics, we have developed a novel SBFL tool with a test case pre-processor that allows

users to filter out test cases that provide contradicting, duplicated or other noisy spectrum. An

additional stage of pre-processing filtering process is carried out before SBFL metrics are being

used to analyse the spectrum and rank the lines of code according to their suspiciousness to be

faulty. This test case pre-processor can be used by all existing SBFL metrics proposed by other

researchers. We have embedded nine Noise Reduction Schemes (NRS) into the proposed tool.

In addition, this tool is also able to automatically select the best performing NRS to improve the

accuracy of the SBFL metric selected by the user.

A case study has been conducted to evaluate the effectiveness of the proposed SBFL tool

with test case pre-processor to locate faulty line of code in seven faulty programs sourced from

Siemens Test Suite. The results of the case study showed that the proposed SBFL tool has

successfully improved the accuracy of SBFL metrics in a vast majority of the cases studied.

61

6 A New Pair Scoring Metric for Spectrum-based

Fault Localization

Introduction 6.1

In the previous chapters, attempts have been made to improve the accuracy of Spectrum-

based Fault Localization (SBFL) by removing test cases that provide duplicated and

contradicting information in its spectra through Noise Reduction Schemes (NRSs). The

empirical evaluations done on faulty programs in Siemens Test Suite have shown positive

results in improving the accuracy of SBFL metrics.

Motivated by these findings, we attempt to design and formulate a new SBFL metric in

this chapter by assigning a higher score to a line of code if the spectra from a pair of pass and

fail test cases provide non-contradicting information that it is likely to be faulty, and vice-versa.

This new SBFL metric is named “Pair Scoring Metric”.

Methodology of New SBFL Metric 6.2

The new Pair Scoring metric works by assigning score to each line of code based on the

execution coverage of a pair of pass test case and fail test case. An example is shown in Figure

6.1. Consider fail test case f and pass test case g for example. Scores are assigned to each line of

code based on whether or not the line is executed by f and g. The highest score (2) is given to

the line which is executed by fail test case f but not executed by pass test case g because this

coverage combination to suggests that the line of code is of high risk and is likely to be faulty.

A lower uncertain score (1) is given to line of code which is executed by both fail test case f and

pass test case g because this line of code is equally likely to faulty and not faulty. Conversely,

the lowest score (0) is given to the line which is executed only by pass test case g but not

executed by fail test case f because this coverage combination to suggests that the line of code is

of low risk and is unlikely to be faulty. The lowest score (0) is also given to the line which is not

executed by both f and g because no information is provided by such coverage.

62

Figure 6.1 Basic of Pair Scoring metric.

Based on these rules of scoring, this pair scoring process is repeated for all possible

pairing combinations of pass test cases and fails test cases in the test suite. The sum of scores

for each line of code is then used to rank the line for its likeliness to be faulty. The example in

Figure 6.1 shows the faulty line is ranked in the fifth out of 25 based on the result of pairing fail

test case f and pass test case g. Based on this illustration of how Pair Scoring works for a pair of

test case, we design a new SBFL metric which can be applied to a set of test cases by taking all

possible combinations of pairing a past and a fail test cases based on the execution or non-

execution of a line of code and multiply it with the scores in the Rules presented in Figure 6.1.

Figure 6.2 Pair Scoring Equation.

Executed Fail 0 1 0 1Executed Pass 0 0 1 1

Score 0 2 0 1

Line of Code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25Fail TC [f] - - 1 - - 1 1 1 1 - # # - - - # # - # # - - - 1 1Pass TC [g] - - 1 - - 1 1 # # - # # - - - 1 # - 1 # - - - 1 1

Fail TC [f] 0 0 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1Pass TC [g] 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 1

|2| Score 0 0 1 0 0 1 1 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

|3| Rank 6 7 3 8 9 4 5 1 2 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

|1|

Rules

Original equation:

Pair Scoring anp × anf × anp × aef × aep × anf × aep × aef × (1)

Simplified equation:

𝑃𝑎𝑖𝑟 𝑆𝑐𝑜𝑟𝑖𝑛𝑔 𝑎𝑒𝑓 𝑎𝑛𝑝 𝑎𝑒𝑝 (2)

63

For each line of code, the total combination for a pass test case that has not executed the

line and a fail test case that has not executed the line is anp multiplied by anf (anp x anf) and

this is multiplied with zero as such pair of test cases are assigned a score of zero. This forms the

first term of Equation (1) in Figure 6.2. Similarly, for the second term of Equation (1), there are

(anp x aef) combinations for pairs of a pass test case that has not executed the line with a fail

test case that has executed the line. This is multiplied with two as such pair of test cases is

assigned a score of two for high risk as shown in the rules in Figure 6.1. Next, for the third term

of Equation (1), there are (aep x anf) combinations for pairs of a pass test case that has executed

the line with a fail test case that has not executed the line. This is multiplied with zero as such

pair of test cases is assigned a score of zero for low risk as shown in the rules in Figure 6.1.

Finally, for the fourth term of Equation (1), there are (aep x aef) combinations for pairs of a

pass test case that has executed the line with a fail test case that has executed the line. This is

multiplied with one as such pair of test cases is assigned an uncertain score of one as shown in

the rules in Figure 6.1 The complete formula for this new “pair scoring” SBFL metric is shown

in equation (1). This equation can be further simplified to equation (2).

Experiment 6.3

In order to evaluate the accuracy of the proposed Pair Scoring SBFL metric, experiments

are conducted on the faulty versions of seven programs in Siemens Test Suites. All test cases

provided in Siemens Test Suite will be executed on each faulty versions of the programs to

collect the program spectra. We exclude print_tokens {v4, v6} because these versions are

identical with the original correct version of the program, where no faulty code exists. As we are

focusing this study on single fault programs, print_tokens {v1}, replace {v21}, schedule {v2,

v7}, and tcas {v10, v11, v15, v31, v32, v33, v40} are excluded because the multiple faulty lines

of code exist in these versions. We also exclude print_tokens {v2}, replace {v12}, tcas {v13,

v14, v36, v38}, tot_info {v6, v10, v19, v21} where the faulty line of code is a non-executable

code. In addition, some of the faulty versions do not produce any failure output even though

faulty line of code existed in the program. As a results, we exclude print_tokens2 {v10}, replace

64

{v32}, and schedule2 {v9}. In total, 106 faulty programs have been used in the experiment to

evaluate the accuracy of the proposed Pair Scoring metric. We use GCC version 4.6.1

Gcov(GNU-GCC), running on Ubuntu 11.10 to gather the spectra from these faulty programs

from Siemens Test Suite.

Table 6.1 Siemens Test Suite specifications.

Program Faulty

Versions LOC

Number of

Test Cases Description Versions excluded in experiments

print_tokens 7 563 4130 Lexical analyser 1, 2, 4, 6

print_tokens2 10 508 4115 Lexical analyser 10

replace 32 563 5542 Pattern recognition 12, 21, 32

schedule 9 410 2650 Priority scheduler 2, 7

schedule2 10 307 2710 Priority scheduler 9

tcas 41 173 1608 Altitude separation 10, 11 ,13, 14, 15, 31, 32, 33, 36, 38, 40

tot_info 23 406 1052 Information measure 6, 10, 19, 21

Result Analysis 6.4

As discussed in chapter 2, the percentage of code inspected (pci) before the faulty line of

code is successfully located is used to measure the accuracy of this SBFL metric. For each

program in Siemens Test Suite, we averaged the pci results of all faulty versions for that

program. The results of the experiments are shown in Table 6.2 for print_token, print_token2,

replace and Table 6.3 for schedule, schedule2, tcas and tot_info. The average pci for all seven

programs in Siemens Test Suite are presented in Table 6.3. The same experiment is repeated on

31 other existing SBFL metrics for accuracy comparison.

65

Table 6.2 Average SBFL Metric Accuracy

print_tokens

print_tokens2

replace

schedule

SBFL Metrics PCI

SBFL Metrics PCI

SBFL Metrics PCI

SBFL Metrics PCI

Ample 0.36

Naish1 1.29

Naish2 2.40

M2 1.11

M2 0.36

Naish2 1.29

Pair Scoring 2.40

Naish1 1.11

Naish1 0.36

Zoltar 1.29

Zoltar 2.43

Naish2 1.11

Naish2 0.36

Wong3 1.37

M2 2.43

Zoltar 1.11

Ochiai 0.36

M2 2.80

Ochiai 2.66

Pair Scoring 1.21

Pair Scoring 0.36

Pair Scoring 3.01

Geometric_Mean 2.92

Ochiai 1.35

Wong3 0.36

Ochiai 4.68

Naish1 2.98

Anderberg 1.74

Zoltar 0.36



Dice 1.74


Geometric_Mean 6.69

Harmonic_Mean 3.10

Jaccard 1.74

Geometric_Mean 0.41

Harmonic_Mean 7.12

Rogot2 3.10

Kulczynski1 1.74

Ochiai2 0.41

Rogot2 7.12

Ochiai2 3.62

QE 1.74

Harmonic_Mean 1.01

Ample 7.87

Anderberg 3.89

Sorensen-Dice 1.74

Rogot2 1.01

Anderberg 7.91

Dice 3.89

Tarantula 1.74

Anderberg 1.48

Dice 7.91

Jaccard 3.89


Dice 1.48

Jaccard 7.91

Kulczynski1 3.89

Geometric_Mean 8.18

Jaccard 1.48

Sorensen-Dice 7.91

Sorensen-Dice 3.89

Harmonic_Mean 8.18

Sorensen-Dice 1.48

Ochiai2 7.97

Cohen 4.08

Rogot2 8.18

Cohen 2.25

CBI_Inc. 8.08

CBI_Inc. 4.10

CBI_Inc. 8.57

CBI_Inc. 2.55

Cohen 8.08

QE 4.10

Cohen 8.57

QE 2.55

QE 8.08

Tarantula 4.10

Euclid 9.54

Tarantula 2.55

Tarantula 8.08

Ample 4.89

Hamming_etc. 9.54

Euclid 5.68

AMPLE2 10.90

Wong3 6.26

M1 9.54

Hamming_etc. 5.68

Binary 10.90

AMPLE2 6.44



Russel & Rao 10.90

Russel & Rao 6.44



Wong1 10.90

Wong1 6.44

Sokal 9.54

Sokal 5.68

Euclid 11.70

Binary 7.02

AMPLE2 9.89

AMPLE2 8.70

Hamming_etc. 11.70

Euclid 15.04

Binary 9.89

Binary 8.70


Hamming_etc. 15.04

Russel & Rao 9.89

Russel & Rao 8.70


M1 15.04

Wong1 9.89

Wong1 8.70

Sokal 11.70


Ochiai2 12.41

Kulczynski1 22.20

Kulczynski1 24.30


Ample 17.68

M1 26.94 M1 28.10 Sokal 15.04 Wong3 19.28

66

Table 6.3 Average SBFL Metric Accuracy

schedule2

tcas

tot_info

Average

SBFL Metrics PCI

SBFL Metrics PCI

SBFL Metrics PCI

SBFL Metrics PCI

AMPLE2 15.73

AMPLE2 7.50

Naish1 2.99 1 Naish2 4.80

Binary 15.73

Binary 7.50

Naish2 2.99 2 Zoltar 4.81

Russel & Rao 15.73

Russel & Rao 7.50

Zoltar 3.02 3 Naish1 4.89

Wong1 15.73

Wong1 7.50

M2 3.99 4 M2 5.62

Naish1 17.35

Naish1 8.11

Pair Scoring 4.07 5 Pair Scoring 5.89

Naish2 17.35

Naish2 8.11

Ochiai 5.07 6 Ochiai 6.23

Zoltar 17.35

Zoltar 8.11

Geometric_Mean 5.70 7 Anderberg 7.73

M2 20.03

M2 8.63

Arithmetic_Mean 5.78 8 Dice 7.73

Ochiai 20.46

Pair Scoring 8.71

Harmonic_Mean 5.87 9 Jaccard 7.73


Ochiai 9.06

Rogot2 5.87 10 Sorensen-Dice 7.73

Pair Scoring 21.48

Harmonic_Mean 9.44

Anderberg 6.13 11 Arithmetic_Mean 7.73

Geometric_Mean 22.20

Rogot2 9.44

Dice 6.13 12 Geometric_Mean 7.96

Harmonic_Mean 23.25


Jaccard 6.13 13 QE 8.05

Rogot2 23.25

Geometric_Mean 9.60

Sorensen-Dice 6.13 14 Tarantula 8.05

Anderberg 23.28

Anderberg 9.66

AMPLE2 6.33 15 Harmonic_Mean 8.28

CBI_Inc. 23.28

Dice 9.66

Binary 6.33 16 Rogot2 8.28

Cohen 23.28

Jaccard 9.66

Russel & Rao 6.33 17 Cohen 8.96

Dice 23.28

Kulczynski1 9.66

Wong1 6.33 18 CBI_Inc. 9.03

Jaccard 23.28

Sorensen-Dice 9.66

Cohen 6.77 19 AMPLE2 9.36

Kulczynski1 23.28

CBI_Inc. 9.69

CBI_Inc. 6.92 20 Russel & Rao 9.36

QE 23.28

Cohen 9.69

QE 6.92 21 Wong1 9.36

Sorensen-Dice 23.28

QE 9.69

Tarantula 6.92 22 Binary 9.44

Tarantula 23.28

Tarantula 9.69

Ochiai2 9.23 23 Ochiai2 9.97

Wong3 23.87

Ochiai2 10.06

Ample 9.90 24 Wong3 10.90

Ochiai2 26.11

Ample 11.37

Wong3 10.49 25 Ample 11.42

Ample 27.84

Wong3 14.65

Kulczynski1 12.87 26 Kulczynski1 13.99

Euclid 28.96

Euclid 16.48

Euclid 17.15 27 Euclid 14.94

Hamming_etc. 28.96

Hamming_etc. 16.48

Hamming_etc. 17.15 28 Hamming_etc. 14.94

M1 28.96

M1 16.48

Rogers&Tanimoto 17.15 29 Rogers&Tanimoto 14.94



Simple_Matching 17.15 30 Simple_Matching 14.94



Sokal 17.15 31 Sokal 14.94

Sokal 28.96

Sokal 16.48

M1 24.05 32 M1 21.30

67

Table 6.4 Comparison of pci between existing SBFL metrics and Pair Scoring metric.

SBFL Metrics

PCI for

SBFL metric

Pair Scoring

PCI


(%)

Frequency of Equivalent

(%)


(%)

Average Improvement

Average Decline

Naish1 5.32 6.02 0.94% 66.98% 32.08% 30.07 3.08

Naish2 5.15 6.02 0.94% 66.98% 32.08% 12.99 3.08

Jaccard 7.70 6.02 51.89% 48.11% 0.00% 3.24 0.00

Anderberg 7.70 6.02 51.89% 48.11% 0.00% 3.24 0.00

Sorensen-Dice 7.70 6.02 51.89% 48.11% 0.00% 3.24 0.00

Dice 7.70 6.02 51.89% 48.11% 0.00% 3.24 0.00

Tarantula 7.96 6.02 52.83% 47.17% 0.00% 3.66 0.00

QE 7.96 6.02 52.83% 47.17% 0.00% 3.66 0.00

CBI_Inc. 8.41 6.02 53.77% 46.23% 0.00% 4.44 0.00

Simple_Matching 16.10 6.02 91.51% 8.49% 0.00% 11.01 0.00

Sokal 16.10 6.02 91.51% 8.49% 0.00% 11.01 0.00

Rogers&Tanimoto 16.10 6.02 91.51% 8.49% 0.00% 11.01 0.00

Hamming_etc. 16.10 6.02 91.51% 8.49% 0.00% 11.01 0.00

Euclid 16.10 6.02 91.51% 8.49% 0.00% 11.01 0.00

Wong1 8.18 6.02 65.09% 0.00% 34.91% 6.25 5.48

Russel & Rao 8.18 6.02 65.09% 0.00% 34.91% 6.25 5.48

Binary 8.34 6.02 65.09% 0.00% 34.91% 6.50 5.48

Ochiai 6.43 6.02 33.02% 62.26% 4.72% 1.53 2.01

M2 5.85 6.02 3.77% 79.25% 16.98% 0.46 1.14

AMPLE2 8.18 6.02 65.09% 0.00% 34.91% 6.25 5.48

Wong3 11.16 6.02 9.43% 59.43% 31.13% 64.17 2.93

Arithmetic_Mean 7.41 6.02 46.23% 50.00% 3.77% 3.13 1.71

Cohen 8.36 6.02 52.83% 47.17% 0.00% 4.44 0.00

Kulczynski1 10.89 6.02 56.60% 43.40% 0.00% 8.60 0.00

M1 19.33 6.02 96.23% 3.77% 0.00% 13.83 0.00

Ochiai2 9.22 6.02 54.72% 45.28% 0.00% 5.84 0.00

Zoltar 5.17 6.02 1.89% 66.04% 32.08% 6.85 3.06

Ample 10.54 6.02 61.32% 38.68% 0.00% 7.37 0.00

Geometric_Mean 7.54 6.02 46.23% 53.77% 0.00% 3.29 0.00

Harmonic_Mean 7.72 6.02 45.28% 42.45% 12.26% 3.89 0.50

Rogot2 7.72 6.02 45.28% 42.45% 12.26% 3.89 0.50

68

Based on the results in Table 6.2 and Table 6.3, it can be observed that the proposed Pair

Scoring approach outperformed majority of the 31 existing SBFL metrics. Overall, when the

average accuracy over seven programs in Siemens Test Suites are taken into account, Pair

Scoring outperformed 27 out of 31 (or 84.3%) of the existing SBFL metrics. Moreover, Pair

Scoring is the best performing SBFL metric for two programs, namely, print_tokens and

replace.

Table 6.4 presents the average pci of 106 faulty versions for each of the existing SBFL

metric and the proposed Pair Scoring metric. The results are consistent with the analysis on each

of the seven programs in Siemens Test Suite presented in Table 6.2 and Table 6.3. Pair Scoring

metric outperformed 27 out of 31 (or 84.3%) of the existing SBFL metrics (rows shaded in grey

in Table 6.4) except {Naish1, Naish2, M2 and Zoltar}. More remarkably, it could be observed

from Table 6.4 that the “frequency of decline” for Pair Scoring is zero for a 18 of the 27 SBFL

metrics, which include {Jaccard, Anderberg, Sorensen-Dice, Dice, Tarantula, QE, CBI_Inc.,

Simple_Matching, Sokal, Roger&Tanimoto, Hamming_etc., Euclid, Cohen, Kulczynski1, M1,

Ochiai2, Ample and Geometric_Mean}. This implies that Pair Scoring is always more accurate

(has lower pci) than these SBFL metrics in locating the faulty line of code in the faulty

programs evaluated.

Conclusion 6.5

The aim of Spectrum-based Fault Localization (SBFL) is to save the time and cost of the

debugging process. Many SBFL metrics have been formulated to rank software code according

to their likeliness to be faulty. A good SBFL metric will rank faulty line of code higher to

minimize the percentage of code needs to inspected by the software developer during the fault

localization process. Although many SBFL metrics have been proposed, every SBFL metric is

designed differently to rank the suspected line of code higher. This makes every SBFL metric

unique and has different capability in fault localization.

In this chapter, we proposed a new SBFL metric named, Pair Scoring. This technique

works by comparing the execution paths of a pair of pass and fail test cases and assign score to

69

each line of code according to its likeliness to be the faulty line of code. All possible

combinations of pass and fail test cases are paired for scoring and the total score for each line is

used to rank its likeliness to be faulty. We evaluated the accuracy of the proposed metric on

faulty versions of seven programs in the Siemens Test Suite and compare its accuracy with

other existing SBFL metrics. Despite its simplicity, we found the proposed metric outperformed

27 of 31 existing SBFL metric.

70

7 Conclusion

Software testing and debugging are not only the most expensive but also the most time

consuming activity in software development life cycle [15][30][34][69][88]. If failures are

detected during the testing process, then the software debugging process has to be undertaken to

locate and correct the faulty software code that causes of failure [7][78]. The longer the software

code, the more time it may take to locate the code that causes the failure. Therefore, various

fault localization techniques have been proposed to reduce the time required to locate the faulty

line of code. One of the promising and widely studied fault localization techniques is Spectrum-

based Fault Localization (SBFL).

As introduced in Chapter 2, the SBFL technique utilizes the code execution profiles of

test cases commonly known as “Spectra” to rank the likeliness of a particular line of code to be

faulty. Based on the intuition that faulty code is more likely to be executed by fail test cases and

not executed by pass test cases, and vice versa, various SBFL metrics (as presented in Chapter

2.3) have been proposed to compute a score for every line of code to rate its likeliness to be

faulty. The software codes are then ranked from the highest score to the lowest score. To locate

the faulty line of code, a software developer will inspect the highest ranked line of code first

followed by the lower ranked line of code until the faulty line of code is located.

In order to save the time and cost in software testing phase, software testers could reduce

the number of test cases by adopting various test suite minimization techniques. These savings

at the testing phase may come at the expense of the accuracy of SBFL metrics in the debugging

process. In the most extreme scenarios, the fault localization process may be started with only

one fail test case, one pass test case or no pass test case. In addition to constraints in time and

cost of testing, these scenarios could also occur in practice due to extremely high or extremely

low failure rates of the software under test.

However, limited test case execution profiles may reduce the accuracy of SBFL metrics.

In Chapter 3, we conducted experiments on faulty programs from Siemens Test Suite to

evaluate the accuracy of SBFL metrics in these extreme scenarios. We found that most SBFL

71

metrics became less accurate under the extreme scenarios of where there was only one fail test

case, one pass test case or no pass test case. Based on the experiment results we also identified

the best performing SBFL metrics for each of these scenarios. This can be used as guidance for

SBFL practitioners to choose the most suitable SBFL metrics to use when faced with such

extreme test case composition during the debugging process. In addition, we further discovered

the convergence in accuracy for SBFL metrics under these extreme scenarios, which implies

that a number of SBFL metrics are equally good for adoption in the extreme scenarios there

were only one fail test case, one pass test case or no pass test case.

A more significant observation from these experiments is that certain groups of SBFL

metrics performed better under the “one pass all fail” scenario and “no pass all fail” scenario

than the “normal scenario” where there are many pass test cases. This suggests that the absence

of excessive pass test cases may reduce noisy spectra and improve the accuracy of these groups

SBFL metrics. This motivate us to further study potential noise reduction schemes that can filter

or remove these test cases with noisy spectra with the objective to improve the accuracy of

SBFL metrics.

In Chapter 4, we proposed six new noise reduction schemes to remove and eliminate test

cases which provide duplicated, ambiguous and contradicting information. We conducted

experiment on 62 faulty versions of seven programs from the Siemens Test Suite to evaluate the

resulting improvements in accuracy for over existing 30 SBFL metrics. From the experiments

results, we found that as high as 27% test cases in Siemens Test Suite have identical spectrum.

This significantly high percentage of test cases with identical spectrum is essentially noise to the

SBFL metrics which may in turn deteriorate the accuracy of SBFL metrics.

Experiments were conducted by applying the six proposed noise reduction schemes on


SBFL metrics. The experiment results showed that the proposed noise reduction schemes have

successfully improved the accuracy of SBFL metrics under study. In addition, we have further

identified the best performing noise reduction scheme for each SBFL metric. These findings

72

provide a guide for SBFL practitioners to select the best performing noise reduction scheme for

the SBFL metrics that they use.

In view of the positive results of the noise reduction schemes, in Chapter 5, we designed

and developed a SBFL tool with a built-in Noise Reduction Schemes (NRS) as test case pre-

processor. Different from existing SBFL tools which focus on SBFL metrics, we have

developed a SBFL tool with a novel test case pre-processor that allows users to filter out test

cases that provide contradicting, duplicated or other noisy spectrum. This additional stage of

pre-processing filtering process is carried out before SBFL metrics are being used to analyse the

spectrum and rank the lines of code according to their suspiciousness to be faulty. This test case

pre-processor can be used by all existing SBFL metrics proposed by other researchers. We

embedded nine Noise Reduction Schemes (NRSs) into the proposed tool, six of the NRSs were

developed in chapter 4, two from Chapter 3, and one was adopted from a previous study on

using non-redundant test cases for SBFL in [54]. In addition, this tool is also able to

automatically recommend the best performing NRS to improve the accuracy of the SBFL metric

selected by the user based on the findings from Chapter 3 and Chapter 4.

A case study has been conducted to evaluate the effectiveness of the proposed SBFL tool

with test case pre-processor to locate faulty line of code in seven faulty programs sourced from

Siemens Test Suite. The results of the case study shown that the proposed SBFL tool has

successfully improved the accuracy of SBFL metrics in a majority of the cases studied.

Inspired by the findings from the previous chapter on noise reduction scheme, in Chapter

6, we attempted to design and formulate a new SBFL metric by assigning a higher score to a

line of code if the spectra from a pair of pass and fail test cases provide non-contradicting

information that it is likely to be faulty, and vice-versa. This new SBFL metric is named “Pair

Scoring Metric”. This new SBFL metric works by comparing the execution paths of a pair of

pass and fail test cases and assign score to each line of code according to its likeliness to be the

faulty line of code. All possible combinations of pass and fail test cases are paired for scoring

and the total score for each line is used to rank its likeliness to be faulty. We evaluated the

73

accuracy of the proposed metric on faulty versions of seven programs in the Siemens Test Suite

and compare its accuracy with other existing SBFL metrics. Despite its simplicity, we found

that the proposed Pair Scoring metric outperformed 27 of 31 existing SBFL metrics.

In summary, this thesis made the following contributions towards improving the accuracy

of Spectrum-based Fault Localization (SBFL) techniques:

1. We found most SBFL metrics become less accurate under the extreme composition of

pass and fail test cases. The best performing SBFL metrics the extreme scenarios of

“one fail all pass”, “one pass all fail” and “no pass all fail” test case compositions

were identified. We also discovered the convergence in the accuracy of SBFL metrics

under these extreme scenarios. More significantly, we found that a small group of

SBFL metrics are more accurate better under these extreme scenarios.

2. We proposed six new noise reduction schemes to filter test cases which provide

duplicated, ambiguous and contradicting information and evaluated the resulting

accuracy improvements in SBFL metrics.

3. Based on Contribution 2, we provided a simple guide for SBFL practitioners to select

the best performing noise reduction scheme to improve the accuracy of the SBFL

metrics that they use.

4. We designed and developed a SBFL tool with a novel test case preprocessor which

allows SBFL practitioners to apply noise reduction schemes to pre-process and filter

test cases with contradicting, duplicated and ambiguous information prior to applying

SBFL.

5. We proposed a new SBFL metric named “Pair Scoring”. This technique compares the

execution paths of every possible pairs of pass and fail test cases and assigns score to

each line of code according to its likeliness to be faulty. We evaluated the accuracy of

the proposed metric and compare it with other existing SBFL metrics. Despite its

simplicity, we found the proposed metric outperformed majority of the existing SBFL

metrics.

74

Threats to Validity, Limitations and Future Work 7.1

As the evaluation on the accuracy of SBFL was based on experiments conducted on

faulty versions of seven programs in Siemens Test Suite, the findings in this thesis are

susceptible to threats to validity in the same way as other empirical study on SBFL. Three

threats of validity have been considered and addressed in our experiments. First is the threat to

internal validity, which refers to errors or experimental bias. We have double checked the code

and implementation of experiments to ensure that there is no error in these experiments to the

best of our knowledge. The second is the threats to external validity, which refers to the

generalizability the experimental findings to other faulty programs. To address this threat, we

have reviewed 40 recent publications in the area of SBFL in Chapter 2 (Section 2.4) and

identified the seven programs in the Siemens Test Suite as the most commonly used test

subjects in empirical studies for SBFL. We have analysed 132 faulty versions from the seven

programs in Siemens Test Suites and included all faulty versions that meet the requirements in

our experiments. The last threat is the threats to construct validity, which refers to the suitability

of our evaluation measure for SBFL accuracy. We have used the percentage of code inspected

(pci) to locate the faulty line of code, which is used to evaluate the accuracy of SBFL in most

past SBFL studies. We noted that pci might be known as the EXAM score [56][82][86][87] or

EXPENSES [24][73] in other SBFL literatures but the formulas and methods of calculation are

the same as pci.

As for the limitation of this thesis, we have only focused our empirical study on faulty

programs with single fault in the program code. We exclude faulty programs that contain

multiple faults because to allow the experiments to be conducted in a controlled way with only

one variable change at a time. As mentioned in the threat to external validity, the testing

subjects used in this thesis are limited to the faulty versions of the seven programs from

Siemens Test Suite.

As for future work, we plan to expand these studies to multiple faults versions and

develop new experimentation setups and methods to evaluate the accuracy of SBFL metrics for

75

programs with multiple faults. An interesting future research direction from this thesis is to

study the accuracy of SBFL metrics at a different level of granularity. It could be worthwhile to

collect the spectra and evaluate the improvement in the accuracy of SBFL metrics for blocks or

modules of program code rather than for lines of code. Though challenging, another potential

future research direction is to perform theoretical analysis on the impact of test case filtering on

the accuracy of SBFL metrics.

76

8 Reference

[1] R. Abreu, P. Zoeteweij, and A. van Gemund. On the Accuracy of Spectrum-based Fault

Localization. Testing: Academic and Industrial Conference – Practice and Research

Techniques, IEEE, 2007, pp.89 – 98.

[2] R. Abreu, P. Zoeteweij, and A. van Gemund. Spectrum-based Multiple Fault

Localization. IEEE/ACM International Conference on Automated Software Engineering,

2009, pp. 88 – 99.

[3] R. Abreu, W. Mayer, M. Stumptner, and A. van Gemund. Refining Spectrum-based Fault

Localization Rankings. SAC‟09, ACM, 2009, pp.409–414.

[4] R. Abreu, A. Gonzalez-Sanchez, and A. van Gemund. Exploiting Count Spectra for

Bayesian Fault Localization. PROMISE, ACM, 2010, pp. 1 – 10.

[5] E. Alves, M. Gligoric, V. Jagannath, and M. d‟Amorim. Fault-Localization Using

Dynamic Slicing and Change Impact Analysis. IEEE, 2011, pp. 520 – 523.

[6] M. R. Anderberg. Cluster Analysis for Applications. Monographs and Textbooks on

Probability and Mathematical Statistics, 1973.

[7] S. Artzi, J. Dolby, F. Tip, and M. Pistoia. Directed Test Generation for Effective Fault

Localization. ISSTA‟10, ACM, 2012, pp. 49 – 59.

[8] G. K. Baah, A. Podgurski, and M. J. Harrold. Causal Inference for Statistical Fault

Localization. ISSTA‟10, ACM, 2010, pp. 73 – 83.

[9] A. Bandyopadhyay, and S. Ghosh. Proximity Based Weighting of Test Cases to Improve

Spectrum Based Fault Localization. ASE, IEEE, 2011, pp. 420 – 423.

[10] A. Bandyopadhyay. Improving Spectrum-Based Fault Localization Using Proximity-

Based Weighting of Test Cases. ASE, IEEE, 2011, pp. 660 – 664.

[11] A. Bandyopadhyay. Mitigating the Effect of Coincidental Correctness in Spectrum Based

Fault Localization. IEEE Fifth International Conference on Software Testing, Verification

and Validation, 2012, pp. 479 – 482.

[12] J. Campos, A. Riboira, A. Perez, and R. Abreu. GZoltar: An Eclipse Plug-In for Testing

and Debugging. ASE‟12, ACM, 2012, pp. 378–381.

77

[13] W. Chan, S. Cheung, and K. Leung. A metamorphic testing approach for online testing of

service-oriented software applications. International Journal of Web Services Research,

2007, Vol. 4, No. 2, pp. 61 – 81.

[14] T. Y. Chen, S. C. Cheung, and S. M. Yiu. Metamorphic testing: a new approach for

generating next test cases. Technical Report HKUST-CS98-01, Dept. of Computer

Science, Hong Kong Univ. of Science and Technology, Tech. Rep. 1998.

[15] Y. Chung, C. Huang, and Y. Huang. A Study of Modified Testing-Based Fault

Localization Method. 14th Pacific Rim International Symposium on Dependable

Computing, IEEE, 2008, pp. 168 – 175.

[16] J. Cohen. A Coefficient of Agreement for Nominal Scales. Educational and Psychological

Measurement. 1960, Vol. 20, No. 1, pp. 37 – 46.

[17] B. Crothers. Toyota sued for fatal crash linked to throttle. CNET,

www.news.cnet.com/9301-13924_3-10448199-64.html, 5 February 2010.

[18] B. Crothers. Toyota software bugs unlike those in flaky PCs. CNET,

www.news.cnet.com/8301-13924_3-10454331-64.html, 18 February 2010.

[19] V. Dallmeier, C. Lindig, and A. Zeller. Lightweight Bug Localization with AMPLE. 6th

International Symposium on Automated Analysis-driven Debugging, ACM, 2005, pp. 99

– 104.

[20] B. C. Dean, W. B. Pressly, B. A. Malloy, and A. A. Whitely. A linear programming

Approach for Automated Localization of Multiple Faults. IEEE/ACM International

Conference on Automated Software Engineering, IEEE, 2009, pp. 640 – 644.

[21] H. Do, S. Elbaum, and G. Rothermel. Supporting Controlled Experimenttion with Testing

Techniques: An Infrastructure and its Potential Impact. Empirical Software Engineering,

Springer, 2005, pp. 405 – 435.

[22] B. Everitt. Graphical Techniques for Multivariate Data, Elsevier, ISBN-10: 0444194614,

1978.

[23] Y. Gao, Z. Zhang, L. Zhang, C. Gong, and Z. Zheng. A Theoretical Study: The Impact of

Cloning Failed Test Cases on the Effectiveness of Fault Localization. 13th International

Conference on Quality Software, IEEE, 2013, pp. 288 – 291.

78

[24] C. Gong, Z. Zheng, W. Li, and P. Hao. Effects of Class Imbalance in Test Suites: An

Empirical Study of Spectrum-Based Fault Localization. IEEE 36th International

Conference on Computer Software and Applications Workshops, 2012, pp.470 – 475.

[25] L. Gong, D. Lo, L. Jiang, and H. Zhang. Diversity Maximization Speedup for Fault

Localization. ASE‟12, ACM, 2012, pp. 30 – 39.

[26] A. Gonzalez. Automatic Error Detection Techniques based on Dynamic Invariants.

Master‟s Thesis, Delft University of Technology, The Netherlands. 2007.

[27] A. Gonzalez-Sanchez, E. Piel, H. G. Gross, and A. van Gemund. Prioritizing Tests for

Software Fault Localization. 10th International Conference on Quality Software, IEEE,

2010, pp. 42 – 51.

[28] A. Gonzalez-Sanchez, R. Abreu, H. G. Gross, and A. van Gemund. Prioritizing Tests for

Fault Localization through Ambiguity Group Reduction. ASE, IEEE, 2011, pp. 83 – 92.

[29] A. Gonzalez-Sanchez, E. Piel, R. Abreu, H. G. Gross, and A. van Gemund. Prioritizing

tests for software fault diagnosis. Software – Practice and Experience, John Wiley &

Sons, 2011, pp. 1105 – 1129.

[30] D. Gopinath, R. N. Zaeem, and S. Khurshid. Improving the Effectiveness of Spectra-

Based Fault Localization Using Specifications. ASE‟12, ACM, 2012, pp. 40 – 49.

[31] A. Gotlieb and B. Botella. Automated Metamorphic Testing. 27th Annual International

Conference on Computer Software and Applications (COMPSAC). 2003, pp. 34 – 40.

[32] H. Guo, and H. L. Victor. Learning from Imbalanced Data Sets with Boosting and Data

Generation: The DataBoost-IM Approach. Sigkdd Explorations, Vol. 6, Issue 1, pp. 30 –

39.

[33] N. Gupta, H. He, X. Zhang, and R. Gupta. Locating Faulty Code Using Failure-Inducing

Chops. ASE‟05, ACM, 2005, pp. 263 – 272.

[34] B. Hailpern and P. Santhanam. Software debugging, testing, and verification. IBM

Systems Journal, 2002, Vol. 40, No1.

[35] R. Hamming. Error Detecting and Error Correcting Codes. Bell System Technical

Journal. 1950, pp. 147 – 160.

[36] D. Hao, L. Zhang, Y. Pan, H. Mei, and J. Sun. A Similarity-Aware Approach to testing

Based Fault Localization. ASE‟05, ACM, 2005, pp. 291 – 294.

79

[37] D. Hao, L. Zhang, Y. Pan, H. Mei, and J. Sun. On Similarity-awareness in testing-based

fault localization. Autom Softw Eng, Springer, 2008, pp. 207 – 249.

[38] M. J. Harrold, G. Rothermel, R. Wu, and L. Yi. An Empirical Investigation of Program

Spectra. PASTE‟98, ACM, 1998, pp. 83 – 90.

[39] M. J. Harrold, G. Rothermel, K. Sayre, R. Wu, and L. Yi. An empirical investigation of

the relationship between spectra differences and regression faults. Software Testing,

Verification and Reliability, John Wiley & Sons, 2000, pp. 171 – 194.

[40] R. M. Hierons. Avoiding Coincidental Correctness in Boundary Value Analysis. ACM

Transactions on Software Engineering and Methodology, 2006, Vol. 15, No. 3, pp. 227 –

241.

[41] M. Hutchins, H. Foster, T. Gorandia, and T. Ostrand. Experiments on the Effectiveness of

Dataflow- and Controlflow-Based Test Adequancy Criteria. IEEE, 1994, pp. 191 – 200.

[42] P. Jaccard. Etude Comparative de la Distribution Florale Dans une Portion des Alpes et

des Jura. Bull. Soc, Vaudoise Sci. Nat, 1901, pp. 547 – 579.

[43] T. Janssen, R. Abreu, and A. van Gemund. Zoltar: A toolset for automatic fault

localization. IEEE/ACM International Conference on Automated Software Engineering,

IEEE, 2009, pp. 662–664.

[44] T. Janssen, R. Abreu, and A. van Gemund. Zoltar: A Spectrum-based Fault Localization

Tool. SINTER‟09, ACM, 2009, pp. 23–29.

[45] D. Jeffrey, N. Gupta, and R. Gupta. Fault Localization Using Value Replacement.

ISSTA‟08, ACM, 2008, pp. 167 – 177.

[46] B. Jiang, Z. Zhang, W. K. Chan, and T. H. Tse. Adaptive Random Test Case

Prioritization. IEEE/ACM International Conference on Automated Software Engineering,

IEEE, 2009, pp. 233 – 244.

[47] B. Jiang and W. K. Chan. On the Integration of Test Adequacy, Test Case Prioritization

and Statistical Fault Localization. 10th International Conference on Quality Software,

IEEE, 2010, pp. 377 – 384.

[48] B. Jiang, W. K. Chan, and T. H. Tse. On Practical Adequate Test Suites for Integrated

Test Case Prioritization and Fault Localization. 11th International Conference on Quality

Software, IEEE, 2011, pp. 21 – 30.

80

[49] J. A. Jones, M. J. Harrold, and J. T. Stasko. Visualization for Fault Localization.

International Conference of Software Engineering Workshop on Software Visualization,

2001, pp. 71 – 75.

[50] J. A. Jones, and M. J. Harrold. Empirical Evaluation of the Tarantula Automatic Fault-

Localization Technique. ASE‟05, ACM, 2005, pp. 273 – 282.

[51] E. Krause. Taxicab Geometry. Mathematics Teacher, 1973, pp. 695 – 706.

[52] T. B. Le, F. Thung, and D. Lo. Theory and Practice, Do They Match? A Case with

Spectrum-Based Fault Localization. IEEE International Conference on Software

Maintenance, 2013, pp. 380 – 383.

[53] H. J. Lee, L. Naish, and K. Ramamohanarao. Study of the relationship of bug consistency

with respect to performance of spectra metrics. 2nd IEEE International Conference on

Computer Science and Information Technology, 2009, pp. 501 – 508.

[54] H. J. Lee, L. Naish, and K. Ramamohanarao. The Effectiveness of Using Non redundant

Test Cases with Program Spectra for Bug Localization. IEEE, 2009, pp. 127 – 134.

[55] H. J. Lee. Software Debugging Using Program Spectra. PhD Thesis, University of

Melbourne, Australia. 2011.

[56] Y. Lei, X. Mao, and T. Y. Chen. Backward-Slice-based Statistical Fault Localization

without Test Oracles. 13th International Conference on Quality Software, IEEE, 2013, pp.

212 – 221.

[57] D. Leon, W. Masri, and A. Podgurski. An Empirical Evaluation of Test Case Filtering

Techniques Based On Exercising Complex Information Flows. ICSE‟05, ACM, 2005, pp.

412 – 421.

[58] B. Liblit. Cooperative Bug Isolation. PhD Thesis, University of California. 2004.

[59] C. Liu, L. Fei, X. Yan, J. Han, and S. P. Midkiff. Statistical Debugging: A Hypothesis

Testing-Based Approach. IEEE Transactions on Software Engineering, 2006, Vol. 32,

No. 10, pp. 831 – 848.

[60] J. Liu. Bank‟s IT woes could be just the beginning. BBC NEWS BUSINESS,

www.bbc.conews/business-18969645.uk, 2 August 2012.

81

[61] H. Lou, F.C. Kuo, and T.Y. Chen. Comparison of adaptive random testing and random

testing under various testing and debugging scenarios. SOFTWARE – PRACTICE AND

EXPERIENCE, John Wiley & Sons, 2012, pp.1055 – 1074.

[62] F. Lourenco, V. Lobo, and F. Bacao. Binary-based Similarity Measures for Categorical

Data and Their Application in Self- Organizing Maps. JOCLAD, 2004.

[63] C. Ma, T. Tan, Y. Chen, and Y. Dong. An If-While-model-based performance evaluation

of ranking metrics for spectra-based fault localization. IEEE 37th Annual Computer

Software and Applications Conference, 2013, pp. 609 – 618.

[64] Y. S. Ma, J. Offutt, and Y. R. Kwon. MuJava: an automated Class mutation system.

Software Testing, Verification and Reliability, 2005, Vol. 15, No. 2, pp. 97 – 133.

[65] A. Maxwell and A. Pilliner. Deriving Coefficients of Reliability and Agreement for

Ratings. Br J Math Stat Psychol, 1968, Vol. 21, No. 1, pp. 105 – 116.

[66] L. J. Morell. A Theory of Fault-Based Testing. IEEE Transaction on Software

Engineering, 1990, Vol. 16, No. 8, pp. 844 – 857.

[67] C. Murphy, K. Shen, and G. Kaiser. Using JML runtime assertion checking to automate

metamorphic testing in applications without test oracles. 2nd International Conference on

Software Testing, Verification and Validation (ICST), 2009, pp. 436 – 445.

[68] C. Murphy. Metamorphic testing techniques to detect defects in applications without test

oracles. PhD dissertation, Columbia University, 2010.

[69] G. J. Myers. The Art of Software Testing. 2nd edn. John Wiley and Sons, Revised and

updated by T.Badgett and T. M. Thomas with C. Sandler: Hoboken, 2004.

[70] L. Naish, H.J. Lee, and K. Ramamohanarao. A Model for Spectra-based Software

Diagnosis. ACM Transactions on Software Engineering and Methodology, Vol. 20, No.

3, Article 11, August 2011.

[71] A. Ochiai. Zoogeographic Studies on the Soleoid Fishes found in Japan and its

Neighbouring Regions. Bull. Jpn. Soc. Sci. Fish, 1957, pp. 526 – 530.

[72] C. Parnin, and A. Orso. Are Automated Debugging Techniques Actually Helping

Programmers ?. ISSTA‟11, ACM, 2011, pp. 199 – 209.

82

[73] P. Rao, T. Y. Chen, Z. Zheng, N. Wang, and K. Chai. Impacts of Test Suite‟s Class

Imbalance on Spectrum-Based Fault Localization Techniques. 13th International

Conference on Quality Software, IEEE, 2013, pp. 260 - 267.

[74] D. Rogers and T. Tanimoto. A Computer Program for Classifying Plants. Science, 1960,

Vol. 132, No. 3434, pp. 1115 – 1118.

[75] E. Rogot and I. Goldberg. A proposed Index for Measuring Agreement in Test-Retest

Studies. Journal of Chronic Diseases. 1966, Vol. 19, No. 9, pp. 991 – 1006.

[76] P. Russel and T. Rao. On Habitat and Association of Species of Anopheline Larvae in

South-Eastern Madras. Journal of the Malaria Institute of India, 1940, pp. 153 – 178.

[77] R. Santelices, J. A. Jones, Y. Yu, and M. J. Harrold. Lightweight Fault-Localization using

Multiple Coverage Types. ICSE‟09, IEEE, 2009, pp. 56 – 66.

[78] N. D. Singpurwalla. Determining an Optimal Time Interval for Testing and Debugging

Software. IEEE Transactions on Software Engineering, 1991, Vol. 17, No. 4, pp. 313 –

319.

[79] R. Sokal and C. Michener. A Statistical Method for Evaluating Systematic Relationship.

Multivariate Statistical Methods, Among-Groups Covariation. 1975, pp. 1409 – 1438.

[80] T. Sorensen. A Method of Establishing Groups of Equal Amplitude in Plant Sociology

based on Similarity of Species Content and its Applications to Analyses of the Vegetation

on Danish Commons. K. danske videntsk. Selsk. 1948.

[81] S. Wang, D. Lo, L. Jiang, Lucia, and H. C. Lau. Search-Based Fault Localization. ASE,

IEEE, 2011, pp. 556 – 559.

[82] W. E. Wong, J. R. Horgan, S. London, and A. P. Mathur. Effect of Test Set Minimization

on Fault Detection Effectiveness. ICSE‟95, ACM, 1995, pp. 41 – 50.

[83] W. E. Wong, Y. Qi, L. Zhao, and K. Cai. Effective Fault Localization using Code

Coverage. 31st Annual International Computer Software and Application Conference

(COMPSAC). IEEE, 2007, pp. 449 – 456.

[84] W. E. Wong, V. Debroy, and D. Xu. Towards Better Fault Localization: A Crosstab-

Based Statistical Approach. IEEE Transactions on Systems, Man, and Cybernetics – part

C: Applications and Reviews, 2012, Vol. 42, No. 3, pp. 378 – 396.

83

[85] X. Y. Xie, T. Y. Chen, and B. W. Xu. Isolating Suspiciousness from Spectrum-Based

Fault Localization Techniques. 10th International Conference on Quality Software, IEEE,

2010, pp.385 – 392.

[86] X. Y. Xie, W. E. Wong, T.Y. Chen, and B. W. X. Spectrum-Based Fault Localization:

Testing Oracles Are No Longer Mandatory. 11th International Conference On Quality

Software, IEEE, 2011, pp.1 – 10.

[87] X. Y. Xie, T. Y. Chen, F. C. Kuo, and B. W. Xu. A Theoretical Analysis of the Risk

Evaluation Formulas for Spectrum-Based Fault Localization. ACM Transactions on

Software Engineering and Methodology, ACM, 2013, Vol. 22, No. 4, Article 31.

[88] C. Yilmaz, A. Paradkar, and C. Williams. Time Will Tell: Fault Localization Using Time

Spectra. ICSE‟08, ACM, 2008, pp. 81 – 90.

[89] S. Yoo, M. Harman, and D. Clark. Fault Localization Prioritization: Comparing

Information-Theoretic and Coverage-Based Approaches. ACM Transactions on Software

Engineering and Methodology, 2013, Vol. 22, No. 3, Article. 19, pp. 1 – 29.

[90] Y. You, C. Huang, K. Peng, and C. Hsu. Evaluation and Analysis of Spectrum-Based

Fault Localization with Modified Similarity Coefficients for Software Debugging. IEEE

37th Annual Computer Software and Application Conference, 2013, pp. 180 – 189.

[91] Y. Yu, J. A. Jones, and M. J. Harrold. An Empirical Study of the Effects of Test-Suite

Reduction on Fault Localization. ICSE‟08, ACM, 2008, pp. 201 – 210.

[92] X. Zhang, N. Gupta, and R. Gupta. Locating Faults Through Automated Predicate

Switching. ICSE‟06, ACM, 2006, pp. 272 – 281.

[93] X. Zhang, Q. Gu, X. Chen, J. Qi, and D. Chen. A Study on Relative Redundancy in Test-

Suite Reduction While Retaining or Improving Fault-Localization Effectiveness.

SAC‟10, ACM, 2010, pp. 2229 – 2236.

[94] L. Zhao, Z. Zhang, L. Wang, and X. Yin. PAFL: Fault Localization via Noise Reduction

on Coverage Vector. SEKE, 2011, pp. 203 – 206.

[95] Lloyds Banking Group and Co-op hit by system errors. BBC NEWS BUSINESS,

www.bbc.co.uk/news/business-19846157, 5 October 2012.

84

9 Appendix

Table 9.1 print_tokens pci accuracy under three extreme scenarios

print_tokens


Naish1 0.36 2.01 1.90 8.70

Naish2 0.36 2.01 1.90 8.70

Jaccard 1.48 2.01 1.90 8.70

Anderberg 1.48 2.01 1.90 8.70

Sorensen-Dice 1.48 2.01 1.90 8.70

Dice 1.48 2.01 1.90 8.70

Tarantula 2.55 2.01 5.09 8.70

QE 2.55 2.01 5.09 12.43

CBI_Inc. 2.55 2.01 9.46 12.43

Simple_Matching 5.68 5.86 1.91 8.70

Sokal 5.68 5.86 1.91 8.70

Rogers&Tanimoto 5.68 5.86 1.91 8.70

Hamming_etc. 5.68 5.86 1.91 8.70

Euclid 5.68 5.86 1.91 8.70

Wong1 8.70 11.13 8.70 8.70

Russel & Rao 8.70 11.13 8.70 8.70

Binary 8.70 11.13 8.70 8.70

Ochiai 0.36 2.01 1.90 8.70

M2 0.36 2.01 1.90 8.70

AMPLE2 8.70 11.13 93.43 8.70

Wong3 0.36 47.18 1.91 8.70

Arithmetic_Mean 0.41 2.01 7.77 8.70

Cohen 2.25 2.01 7.77 8.70

Kulczynski1 22.20 19.24 55.87 58.02

M1 26.94 27.12 57.22 58.02

Ochiai2 0.41 2.01 8.16 58.02

Zoltar 0.36 2.01 1.90 8.70

Ample 0.36 2.03 8.63 8.70

Geometric_Mean 0.41 2.01 7.77 8.70

Harmonic_Mean 1.01 6.24 10.94 8.70

Rogot2 1.01 6.24 10.94 8.70

85

Table 9.2 print_tokens2 pci accuracy under three extreme scenarios

print_tokens2


Naish1 1.29 6.01 4.88 10.90

Naish2 1.29 6.01 4.88 10.90

Jaccard 7.91 6.01 4.88 10.90

Anderberg 7.91 6.01 4.88 10.90

Sorensen-Dice 7.91 6.01 4.88 10.90

Dice 7.91 6.01 4.88 10.90

Tarantula 8.08 6.01 12.76 10.90

QE 8.08 6.01 12.76 16.51

CBI_Inc. 8.08 6.01 24.59 16.51

Simple_Matching 11.70 11.99 4.88 10.90

Sokal 11.70 11.99 4.88 10.90

Rogers&Tanimoto 11.70 11.99 4.88 10.90

Hamming_etc. 11.70 11.99 4.88 10.90

Euclid 11.70 11.99 4.88 10.90

Wong1 10.90 13.55 10.90 10.90

Russel & Rao 10.90 13.55 10.90 10.90

Binary 10.90 13.55 10.90 10.90

Ochiai 4.68 6.01 4.88 10.90

M2 2.80 6.01 4.88 10.90

AMPLE2 10.90 13.55 89.20 10.90

Wong3 1.37 57.47 4.88 10.90

Arithmetic_Mean 6.62 6.01 22.16 10.90

Cohen 8.08 6.01 22.16 10.90

Kulczynski1 24.30 21.18 42.11 57.43

M1 28.10 28.38 42.11 57.43

Ochiai2 7.97 6.01 22.86 57.43

Zoltar 1.29 6.01 4.88 10.90

Ample 7.87 8.65 24.37 10.90

Geometric_Mean 6.69 6.01 22.16 10.90

Harmonic_Mean 7.12 8.05 26.46 10.90

Rogot2 7.12 8.05 26.46 10.90

86

Table 9.3 replace pci accuracy under three extreme scenarios

replace


Naish1 2.98 4.42 4.55 7.02

Naish2 2.40 4.48 3.97 6.44

Jaccard 3.89 4.42 3.97 6.44

Anderberg 3.89 4.42 3.97 6.44

Sorensen-Dice 3.89 4.42 3.97 6.44

Dice 3.89 4.42 3.97 6.44

Tarantula 4.10 4.42 9.80 6.44

QE 4.10 4.42 9.80 12.98

CBI_Inc. 4.10 4.47 19.45 12.98

Simple_Matching 15.04 15.82 3.80 6.44

Sokal 15.04 15.82 3.80 6.44

Rogers&Tanimoto 15.04 15.82 3.80 6.44

Hamming_etc. 15.04 15.82 3.80 6.44

Euclid 15.04 15.82 3.80 6.44

Wong1 6.44 9.21 6.44 6.44

Russel & Rao 6.44 9.21 6.44 6.44

Binary 7.02 9.21 7.02 7.02

Ochiai 2.66 4.42 3.97 6.44

M2 2.43 4.42 3.97 6.44

AMPLE2 6.44 9.21 90.04 6.44

Wong3 6.26 68.85 3.80 6.44

Arithmetic_Mean 2.99 4.48 16.54 6.44

Cohen 4.08 4.48 16.54 6.44

Kulczynski1 3.89 4.40 39.14 49.25

M1 15.04 15.80 42.72 49.25

Ochiai2 3.62 4.42 16.72 49.25

Zoltar 2.43 4.42 4.03 6.44

Ample 4.89 7.90 19.37 6.44

Geometric_Mean 2.92 4.48 16.54 6.44

Harmonic_Mean 3.10 6.49 18.82 6.44

Rogot2 3.10 6.49 18.82 6.44

87

Table 9.4 schedule pci accuracy under three extreme scenarios

schedule


Naish1 1.11 2.59 6.89 9.89

Naish2 1.11 2.59 6.89 9.89

Jaccard 1.74 2.59 6.89 9.89

Anderberg 1.74 2.59 6.89 9.89

Sorensen-Dice 1.74 2.59 6.89 9.89

Dice 1.74 2.59 6.89 9.89

Tarantula 1.74 2.59 7.90 9.89

QE 1.74 2.59 7.90 11.86

CBI_Inc. 8.57 9.42 29.44 11.86

Simple_Matching 9.54 10.92 6.89 9.89

Sokal 9.54 10.92 6.89 9.89

Rogers&Tanimoto 9.54 10.92 6.89 9.89

Hamming_etc. 9.54 10.92 6.89 9.89

Euclid 9.54 10.92 6.89 9.89

Wong1 9.89 11.36 9.89 9.89

Russel & Rao 9.89 11.36 9.89 9.89

Binary 9.89 11.36 9.89 9.89

Ochiai 1.35 2.59 6.89 9.89

M2 1.11 2.59 6.89 9.89

AMPLE2 9.89 11.36 92.51 9.89

Wong3 19.28 68.18 6.89 9.89

Arithmetic_Mean 8.04 9.42 29.04 9.89

Cohen 8.57 9.42 29.04 9.89

Kulczynski1 1.74 2.59 27.35 51.82

M1 9.54 10.92 28.35 51.82

Ochiai2 12.41 13.27 18.55 51.82

Zoltar 1.11 2.59 6.89 9.89

Ample 17.68 20.30 39.95 9.89

Geometric_Mean 8.18 9.42 29.04 9.89

Harmonic_Mean 8.18 9.39 30.28 9.89

Rogot2 8.18 9.39 30.28 9.89

88

Table 9.5 schedule2 pci accuracy under three extreme scenarios

schedule2


Naish1 17.35 23.90 15.90 15.73

Naish2 17.35 23.90 15.90 15.73

Jaccard 23.28 23.90 15.90 15.73

Anderberg 23.28 23.90 15.90 15.73

Sorensen-Dice 23.28 23.90 15.90 15.73

Dice 23.28 23.90 15.90 15.73

Tarantula 23.28 23.90 19.01 15.73

QE 23.28 23.90 19.01 20.87

CBI_Inc. 23.28 23.90 54.23 20.87

Simple_Matching 28.96 28.96 15.91 15.73

Sokal 28.96 28.96 15.91 15.73

Rogers&Tanimoto 28.96 28.96 15.91 15.73

Hamming_etc. 28.96 28.96 15.91 15.73

Euclid 28.96 28.96 15.91 15.73

Wong1 15.73 19.51 15.73 15.73

Russel & Rao 15.73 19.51 15.73 15.73

Binary 15.73 19.51 15.73 15.73

Ochiai 20.46 23.90 15.90 15.73

M2 20.03 23.90 15.90 15.73

AMPLE2 15.73 19.51 86.04 15.73

Wong3 23.87 84.00 15.91 15.73

Arithmetic_Mean 20.83 23.90 54.12 15.73

Cohen 23.28 23.90 54.13 15.73

Kulczynski1 23.28 23.90 17.80 62.85

M1 28.96 28.96 17.87 62.85

Ochiai2 26.11 23.90 57.05 62.85

Zoltar 17.35 23.90 15.90 15.73

Ample 27.84 28.53 61.81 15.73

Geometric_Mean 22.20 23.90 54.13 15.73

Harmonic_Mean 23.25 23.90 54.40 15.73

Rogot2 23.25 23.90 54.40 15.73

89

Table 9.6 tcas pci accuracy under three extreme scenarios

tcas


Naish1 8.11 9.54 7.55 7.50

Naish2 8.11 9.54 7.55 7.50

Jaccard 9.66 9.54 7.55 7.50

Anderberg 9.66 9.54 7.55 7.50

Sorensen-Dice 9.66 9.54 7.55 7.50

Dice 9.66 9.54 7.55 7.50

Tarantula 9.69 9.54 9.23 7.50

QE 9.69 9.54 9.23 8.54

CBI_Inc. 9.69 9.54 33.82 8.54

Simple_Matching 16.48 16.55 7.55 7.50

Sokal 16.48 16.55 7.55 7.50

Rogers&Tanimoto 16.48 16.55 7.55 7.50

Hamming_etc. 16.48 16.55 7.55 7.50

Euclid 16.48 16.55 7.55 7.50

Wong1 7.50 7.99 7.50 7.50

Russel & Rao 7.50 7.99 7.50 7.50

Binary 7.50 7.99 7.50 7.50

Ochiai 9.06 9.54 7.55 7.50

M2 8.63 9.54 7.55 7.50

AMPLE2 7.50 7.99 76.73 7.50

Wong3 14.65 74.56 8.28 7.50

Arithmetic_Mean 9.46 9.54 32.55 7.50

Cohen 9.69 9.54 33.71 7.50

Kulczynski1 9.66 9.54 25.23 48.37

M1 16.48 16.55 27.13 48.37

Ochiai2 10.06 9.54 34.21 48.37

Zoltar 8.11 9.54 7.55 7.50

Ample 11.37 11.74 35.14 7.50

Geometric_Mean 9.60 9.54 33.85 7.50

Harmonic_Mean 9.44 9.45 34.10 7.50

Rogot2 9.44 9.45 34.10 7.50

90

Table 9.7 tot_info pci accuracy under three extreme scenarios

tot_info


Naish1 2.99 8.66 5.23 6.33

Naish2 2.99 8.66 5.23 6.33

Jaccard 6.13 8.66 5.23 6.33

Anderberg 6.13 8.66 5.23 6.33

Sorensen-Dice 6.13 8.66 5.23 6.33

Dice 6.13 8.66 5.23 6.33

Tarantula 6.92 8.66 10.55 6.33

QE 6.92 8.66 10.55 13.22

CBI_Inc. 6.92 8.66 36.80 13.22

Simple_Matching 17.15 17.72 5.26 6.33

Sokal 17.15 17.72 5.26 6.33

Rogers&Tanimoto 17.15 17.72 5.26 6.33

Hamming_etc. 17.15 17.72 5.26 6.33

Euclid 17.15 17.72 5.26 6.33

Wong1 6.33 9.72 6.33 6.33

Russel & Rao 6.33 9.72 6.33 6.33

Binary 6.33 9.72 6.33 6.33

Ochiai 5.07 8.66 5.23 6.33

M2 3.99 8.66 5.23 6.33

AMPLE2 6.33 9.72 92.87 6.33

Wong3 10.49 72.37 5.34 6.33

Arithmetic_Mean 5.78 8.66 35.15 6.33

Cohen 6.77 8.66 35.15 6.33

Kulczynski1 12.87 15.17 24.86 58.14

M1 24.05 24.61 25.29 58.14

Ochiai2 9.23 8.66 37.54 58.14

Zoltar 3.02 8.66 5.23 6.33

Ample 9.90 14.48 41.68 6.33

Geometric_Mean 5.70 8.66 35.15 6.33

Harmonic_Mean 5.87 10.48 36.99 6.33

Rogot2 5.87 10.48 36.99 6.33

91

Table 9.8 print_tokens pci accuracy with and without NRS applied

print_tokens


Naish1 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36

Naish2 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36

Jaccard 3.73 3.02 6.57 6.04 0.36 0.36 0.36 0.36

Anderberg 3.73 3.02 6.57 6.04 0.36 0.36 0.36 0.36

Sorensen-Dice 3.73 3.02 6.57 6.04 0.36 0.36 0.36 0.36

Dice 3.73 3.02 6.57 6.04 0.36 0.36 0.36 0.36

Tarantula 6.93 6.57 7.82 6.93 1.60 1.60 1.95 1.95

QE 6.93 6.57 7.82 6.93 1.60 1.60 1.95 1.95

CBI_Inc. 6.93 6.57 7.82 6.93 1.60 1.60 1.95 1.95

Simple_Matching 13.32 12.61 13.32 12.79 10.12 10.12 10.12 10.12

Sokal 13.32 12.61 13.32 12.79 10.12 10.12 10.12 10.12

Rogers&Tanimoto 13.32 12.61 13.32 12.79 10.12 10.12 10.12 10.12

Hamming_etc. 13.32 12.61 13.32 12.79 10.12 10.12 10.12 10.12

Euclid 13.32 12.61 13.32 12.79 10.12 10.12 10.12 10.12

Wong1 7.46 7.46 8.17 8.17 7.46 7.46 8.17 8.17

Russel & Rao 7.46 7.46 8.17 8.17 7.46 7.46 8.17 8.17

Binary 7.46 7.46 8.17 8.17 7.46 7.46 8.17 8.17

Ochiai 0.36 0.36 0.53 0.53 0.36 0.36 0.36 0.36

M2 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36

AMPLE2 7.46 7.46 8.17 8.17 7.46 7.46 8.17 8.17

Wong3 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36

Arithmetic_Mean 0.53 0.53 2.31 1.60 0.36 0.36 0.36 0.36

Cohen 6.04 4.09 6.93 6.57 0.36 0.36 0.53 0.53

Kulczynski1 3.73 3.02 6.57 6.04 0.36 0.36 0.36 0.36

M1 13.32 12.61 13.32 12.79 10.12 10.12 10.12 10.12

Ochiai2 0.53 0.53 0.53 0.53 0.36 0.36 0.36 0.36

Zoltar 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36

Ample 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36

Geometric_Mean 0.53 0.53 0.53 0.53 0.36 0.36 0.36 0.36

Harmonic_Mean 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36

Rogot2 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36

92

Table 9.9 print_tokens2 pci accuracy with and without NRS applied

print_tokens2


Naish1 1.57 1.57 1.60 1.60 1.54 1.54 1.57 1.57

Naish2 1.57 1.57 1.60 1.60 1.54 1.54 1.57 1.57

Jaccard 10.08 10.06 10.28 10.14 6.96 6.96 7.36 7.36

Anderberg 10.08 10.06 10.28 10.14 6.96 6.96 7.36 7.36

Sorensen-Dice 10.08 10.06 10.28 10.14 6.96 6.96 7.36 7.36

Dice 10.08 10.06 10.28 10.14 6.96 6.96 7.36 7.36

Tarantula 10.31 10.31 10.90 10.36 8.26 8.20 8.66 8.68

QE 10.31 10.31 10.90 10.36 8.26 8.20 8.66 8.68

CBI_Inc. 10.31 10.31 10.90 10.36 8.26 8.20 8.66 8.68

Simple_Matching 14.96 14.29 15.08 14.57 12.87 12.84 13.12 12.87

Sokal 14.96 14.29 15.08 14.57 12.87 12.84 13.12 12.87

Rogers&Tanimoto 14.96 14.29 15.08 14.57 12.87 12.84 13.12 12.87

Hamming_etc. 14.96 14.29 15.08 14.57 12.87 12.84 13.12 12.87

Euclid 14.96 14.29 15.08 14.57 12.87 12.84 13.12 12.87

Wong1 10.16 10.16 10.24 10.24 10.16 10.16 10.24 10.24

Russel & Rao 10.16 10.16 10.24 10.24 10.16 10.16 10.24 10.24

Binary 10.16 10.16 10.24 10.24 10.16 10.16 10.24 10.24

Ochiai 5.93 5.73 6.91 6.91 3.20 3.17 4.04 3.96

M2 3.51 3.51 3.87 3.82 2.75 2.75 3.23 3.23

AMPLE2 10.16 10.16 10.24 10.24 10.16 10.16 10.24 10.24

Wong3 1.68 1.68 1.71 1.71 1.57 1.57 1.60 1.60

Arithmetic_Mean 8.43 8.43 8.74 8.62 6.57 6.57 6.80 6.80

Cohen 10.31 10.28 10.62 10.31 8.12 8.12 8.29 8.29

Kulczynski1 10.08 10.06 10.28 10.14 6.96 6.96 7.36 7.36

M1 14.96 14.29 15.08 14.57 12.87 12.84 13.12 12.87

Ochiai2 10.17 10.06 10.28 10.25 10.36 10.36 10.48 10.48

Zoltar 1.57 1.57 1.60 1.60 1.54 1.54 1.57 1.57

Ample 10.03 10.03 9.69 9.89 9.38 9.35 9.44 9.41

Geometric_Mean 8.51 8.37 8.60 8.60 6.68 6.49 6.96 6.94

Harmonic_Mean 8.93 8.93 8.93 8.93 7.44 7.44 7.73 7.84

Rogot2 8.93 8.93 8.93 8.93 7.44 7.44 7.73 7.84

93

Table 9.10 replace pci accuracy with and without NRS applied

replace


Naish1 2.77 2.77 2.77 2.77 3.65 3.65 3.65 3.65

Naish2 2.08 2.08 2.08 2.08 2.96 2.96 2.96 2.96

Jaccard 3.45 3.36 3.89 3.81 4.65 4.65 5.05 5.05

Anderberg 3.45 3.36 3.89 3.81 4.65 4.65 5.05 5.05

Sorensen-Dice 3.45 3.36 3.89 3.81 4.65 4.65 5.05 5.05

Dice 3.45 3.36 3.89 3.81 4.65 4.65 5.05 5.05

Tarantula 3.69 3.62 4.05 3.99 4.98 4.98 5.21 5.23

QE 3.69 3.62 4.05 3.99 4.98 4.98 5.21 5.23

CBI_Inc. 3.69 3.62 4.05 3.99 4.98 4.98 5.21 5.23

Simple_Matching 14.47 14.10 14.54 14.39 17.09 16.98 17.22 17.09

Sokal 14.47 14.10 14.54 14.39 17.09 16.98 17.22 17.09

Rogers&Tanimoto 14.47 14.10 14.54 14.39 17.09 16.98 17.22 17.09

Hamming_etc. 14.47 14.10 14.54 14.39 17.09 16.98 17.22 17.09

Euclid 14.47 14.10 14.54 14.39 17.09 16.98 17.22 17.09

Wong1 7.10 7.10 7.32 7.32 7.10 7.10 7.32 7.32

Russel & Rao 7.10 7.10 7.32 7.32 7.10 7.10 7.32 7.32

Binary 7.78 7.78 7.99 7.99 7.78 7.78 7.99 7.99

Ochiai 2.16 2.18 2.33 2.23 3.33 3.33 3.35 3.35

M2 1.92 1.92 1.90 1.90 3.08 3.08 3.16 3.16

AMPLE2 7.10 7.10 7.32 7.32 7.10 7.10 7.32 7.32

Wong3 3.91 3.91 6.27 6.27 5.31 5.31 5.33 5.33

Arithmetic_Mean 2.52 2.52 2.88 2.85 3.55 3.56 3.77 3.78

Cohen 3.66 3.58 4.01 3.94 4.86 4.85 5.14 5.13

Kulczynski1 3.45 3.36 3.89 3.81 4.65 4.65 5.05 5.05

M1 14.47 14.10 14.54 14.39 17.09 16.98 17.22 17.09

Ochiai2 3.02 3.03 3.42 3.30 4.91 4.90 5.33 5.32

Zoltar 2.12 2.12 2.12 2.12 3.00 3.00 3.00 3.00

Ample 3.64 3.64 3.63 3.62 6.76 6.74 6.65 6.65

Geometric_Mean 2.39 2.39 2.54 2.52 3.51 3.51 3.57 3.57

Harmonic_Mean 2.34 2.33 2.42 2.47 3.52 3.52 3.68 3.68

Rogot2 2.34 2.33 2.42 2.47 3.52 3.52 3.68 3.68

94

Table 9.11 schedule pci accuracy with and without NRS applied

schedule


Naish1 0.89 0.89 0.89 0.89 0.89 0.89 1.05 0.89

Naish2 0.89 0.89 0.89 0.89 0.89 0.89 1.05 0.89

Jaccard 2.19 1.86 1.05 1.05 1.54 1.54 1.30 1.13

Anderberg 2.19 1.86 1.05 1.05 1.54 1.54 1.30 1.13

Sorensen-Dice 2.19 1.86 1.05 1.05 1.54 1.54 1.30 1.13

Dice 2.19 1.86 1.05 1.05 1.54 1.54 1.30 1.13

Tarantula 2.19 2.19 1.05 1.05 1.54 1.70 1.30 1.13

QE 2.19 2.19 1.05 1.05 1.54 1.70 1.30 1.13

CBI_Inc. 2.19 2.19 1.05 1.05 1.54 1.70 1.30 1.13

Simple_Matching 4.78 4.21 4.78 4.37 6.07 6.07 6.24 6.07

Sokal 4.78 4.21 4.78 4.37 6.07 6.07 6.24 6.07

Rogers&Tanimoto 4.78 4.21 4.78 4.37 6.07 6.07 6.24 6.07

Hamming_etc. 4.78 4.21 4.78 4.37 6.07 6.07 6.24 6.07

Euclid 4.78 4.21 4.78 4.37 6.07 6.07 6.24 6.07

Wong1 12.63 12.63 12.63 12.63 12.63 12.63 12.63 12.63

Russel & Rao 12.63 12.63 12.63 12.63 12.63 12.63 12.63 12.63

Binary 12.63 12.63 12.63 12.63 12.63 12.63 12.63 12.63

Ochiai 1.46 1.54 1.05 1.05 0.89 0.89 1.22 1.05

M2 0.89 0.89 0.89 0.89 0.89 0.89 1.05 0.89

AMPLE2 12.63 12.63 12.63 12.63 12.63 12.63 12.63 12.63

Wong3 0.89 0.89 22.06 22.06 0.89 0.89 22.22 0.89

Arithmetic_Mean 1.13 1.62 1.05 1.05 1.13 1.13 1.30 1.13

Cohen 2.19 2.19 1.05 1.05 1.54 1.54 1.30 1.13

Kulczynski1 2.19 1.86 1.05 1.05 1.54 1.54 1.30 1.13

M1 4.78 4.21 4.78 4.37 6.07 6.07 6.24 6.07

Ochiai2 2.19 1.62 1.05 1.05 1.46 1.46 1.30 1.13

Zoltar 0.89 0.89 0.89 0.89 0.89 0.89 1.05 0.89

Ample 1.54 1.46 2.27 0.89 0.89 0.89 1.05 0.89

Geometric_Mean 1.46 1.54 1.05 1.05 0.89 0.89 1.22 1.13

Harmonic_Mean 1.46 1.38 0.81 0.73 0.81 0.81 0.97 0.81

Rogot2 1.46 1.38 0.81 0.73 0.81 0.81 0.97 0.81

95

Table 9.12 schedule2 pci accuracy with and without NRS applied

schedule2


Naish1 18.30 18.30 19.40 19.40 17.53 17.53 18.63 18.63

Naish2 18.30 18.30 19.40 19.40 17.53 17.53 18.63 18.63

Jaccard 24.97 25.05 22.61 22.70 24.85 24.85 22.66 22.66

Anderberg 24.97 25.05 22.61 22.70 24.85 24.85 22.66 22.66

Sorensen-Dice 24.97 25.05 22.61 22.70 24.85 24.85 22.66 22.66

Dice 24.97 25.05 22.61 22.70 24.85 24.85 22.66 22.66

Tarantula 24.97 25.14 22.70 22.70 25.42 25.42 22.70 22.70

QE 24.97 25.14 22.70 22.70 25.42 25.42 22.70 22.70

CBI_Inc. 24.97 25.14 22.70 22.70 25.42 25.42 22.70 22.70

Simple_Matching 29.86 29.86 29.86 29.86 29.08 29.08 29.08 29.08

Sokal 29.86 29.86 29.86 29.86 29.08 29.08 29.08 29.08

Rogers&Tanimoto 29.86 29.86 29.86 29.86 29.08 29.08 29.08 29.08

Hamming_etc. 29.86 29.86 29.86 29.86 29.08 29.08 29.08 29.08

Euclid 29.86 29.86 29.86 29.86 29.08 29.08 29.08 29.08

Wong1 16.80 16.80 17.25 17.25 16.80 16.80 17.25 17.25

Russel & Rao 16.80 16.80 17.25 17.25 16.80 16.80 17.25 17.25

Binary 16.80 16.80 17.25 17.25 16.80 16.80 17.25 17.25

Ochiai 21.80 21.80 20.46 20.74 21.56 21.56 20.38 20.38

M2 21.31 21.51 20.38 20.58 20.78 20.95 20.05 20.05

AMPLE2 16.80 16.80 17.25 17.25 16.80 16.80 17.25 17.25

Wong3 18.30 18.30 26.85 26.85 17.53 17.53 25.89 25.89

Arithmetic_Mean 22.21 22.17 20.54 20.62 24.41 24.41 21.68 21.68

Cohen 24.97 25.14 22.70 22.70 25.18 25.42 22.70 22.70

Kulczynski1 24.97 25.05 22.61 22.70 24.85 24.85 22.66 22.66

M1 29.86 29.86 29.86 29.86 29.08 29.08 29.08 29.08

Ochiai2 28.15 28.15 26.72 26.72 28.23 28.23 27.38 27.38

Zoltar 18.30 18.30 19.40 19.40 17.53 17.53 18.63 18.63

Ample 28.64 28.55 28.80 28.96 27.86 28.11 28.19 28.19

Geometric_Mean 23.75 23.92 21.84 21.88 24.49 24.49 22.41 22.41

Harmonic_Mean 24.93 25.10 22.65 22.65 24.77 25.26 22.57 22.57

Rogot2 24.93 25.10 22.65 22.65 24.77 25.26 22.57 22.57

96

Table 9.13 tcas pci accuracy with and without NRS applied

tcas


Naish1 14.45 14.45 17.92 17.92 14.45 14.45 17.92 17.92

Naish2 14.45 14.45 17.92 17.92 14.45 14.45 17.92 17.92

Jaccard 19.08 19.08 17.92 17.92 17.34 18.50 17.92 17.92

Anderberg 19.08 19.08 17.92 17.92 17.34 18.50 17.92 17.92

Sorensen-Dice 19.08 19.08 17.92 17.92 17.34 18.50 17.92 17.92

Dice 19.08 19.08 17.92 17.92 17.34 18.50 17.92 17.92

Tarantula 19.08 19.08 17.92 17.92 19.08 19.08 17.92 17.92

QE 19.08 19.08 17.92 17.92 19.08 19.08 17.92 17.92

CBI_Inc. 19.08 19.08 17.92 17.92 19.08 19.08 17.92 17.92

Simple_Matching 25.43 25.43 25.43 25.43 24.86 24.28 25.43 24.86

Sokal 25.43 25.43 25.43 25.43 24.86 24.28 25.43 24.86

Rogers&Tanimoto 25.43 25.43 25.43 25.43 24.86 24.28 25.43 24.86

Hamming_etc. 25.43 25.43 25.43 25.43 24.86 24.28 25.43 24.86

Euclid 25.43 25.43 25.43 25.43 24.86 24.28 25.43 24.86

Wong1 12.14 12.14 14.45 14.45 12.14 12.14 14.45 14.45

Russel & Rao 12.14 12.14 14.45 14.45 12.14 12.14 14.45 14.45

Binary 12.14 12.14 14.45 14.45 12.14 12.14 14.45 14.45

Ochiai 16.47 17.92 17.92 17.92 15.03 16.76 17.92 17.92

M2 15.61 15.90 17.92 17.92 14.45 14.45 17.92 17.92

AMPLE2 12.14 12.14 14.45 14.45 12.14 12.14 14.45 14.45

Wong3 14.45 14.45 65.17 65.61 15.03 15.61 84.97 84.97

Arithmetic_Mean 18.50 18.50 17.92 17.92 17.92 19.08 17.92 17.92

Cohen 19.08 19.08 17.92 17.92 19.08 19.08 17.92 17.92

Kulczynski1 19.08 19.08 17.92 17.34 17.34 18.50 17.92 17.34

M1 25.43 25.43 25.43 24.86 24.86 24.28 25.43 24.28

Ochiai2 20.81 20.81 17.92 17.92 19.65 19.08 17.92 17.92

Zoltar 14.45 15.61 17.92 17.92 14.45 15.03 17.92 17.92

Ample 21.97 23.70 22.25 23.41 18.50 19.65 20.81 20.81

Geometric_Mean 19.08 19.08 17.92 17.92 17.92 19.08 17.92 17.92

Harmonic_Mean 19.08 19.08 17.92 17.92 17.92 19.08 17.92 17.92

Rogot2 19.08 19.08 17.92 17.92 17.92 19.08 17.92 17.92

97

Table 9.14 tot_info pci accuracy with and without NRS applied

tot_info


Naish1 3.57 3.57 3.98 3.98 3.47 3.47 3.87 3.87

Naish2 3.57 3.57 3.98 3.98 3.47 3.47 3.87 3.87

Jaccard 7.62 6.84 8.39 7.25 5.21 5.17 5.52 5.52

Anderberg 7.62 6.84 8.39 7.25 5.21 5.17 5.52 5.52

Sorensen-Dice 7.62 6.84 8.39 7.25 5.21 5.17 5.52 5.52

Dice 7.62 6.84 8.39 7.25 5.21 5.17 5.52 5.52

Tarantula 8.43 7.71 9.04 8.29 6.19 6.53 6.51 6.35

QE 8.43 7.71 9.04 8.29 6.19 6.53 6.51 6.35

CBI_Inc. 8.43 7.71 9.04 8.29 6.19 6.53 6.51 6.35

Simple_Matching 17.38 16.63 17.91 16.70 15.87 14.37 16.34 15.87

Sokal 17.38 16.63 17.91 16.70 15.87 14.37 16.34 15.87

Rogers&Tanimoto 17.38 16.63 17.91 16.70 15.87 14.37 16.34 15.87

Hamming_etc. 17.38 16.63 17.91 16.70 15.87 14.37 16.34 15.87

Euclid 17.38 16.63 17.91 16.70 15.87 14.37 16.34 15.87

Wong1 6.86 6.86 6.90 6.90 6.86 6.86 6.90 6.90

Russel & Rao 6.86 6.86 6.90 6.90 6.86 6.86 6.90 6.90

Binary 6.86 6.86 6.90 6.90 6.86 6.86 6.90 6.90

Ochiai 6.30 5.67 6.54 6.00 4.47 4.47 4.91 4.91

M2 4.84 4.70 5.38 5.08 4.28 4.28 4.80 4.80

AMPLE2 6.86 6.86 6.90 6.90 6.86 6.86 6.90 6.90

Wong3 3.57 3.57 9.06 9.06 3.47 3.47 3.87 3.87

Arithmetic_Mean 7.14 6.56 7.16 6.77 5.10 5.17 5.63 5.54

Cohen 8.29 7.57 8.95 8.11 6.12 6.09 6.32 6.23

Kulczynski1 7.62 6.84 8.39 7.25 5.21 5.17 5.52 5.52

M1 17.38 16.63 17.91 16.70 15.87 14.37 16.34 15.87

Ochiai2 11.29 10.50 11.38 10.19 10.20 9.99 9.75 9.62

Zoltar 3.61 3.61 4.06 4.06 3.47 3.47 3.94 3.94

Ample 9.03 8.25 9.06 8.11 9.38 9.73 9.20 9.10

Geometric_Mean 7.04 6.60 7.39 6.72 4.86 4.82 5.58 5.58

Harmonic_Mean 7.23 6.39 7.83 6.93 5.70 5.70 6.05 5.95

Rogot2 7.23 6.39 7.83 6.93 5.70 5.70 6.05 5.95

98

Table 9.15 Analysis of pci for 62 faulty versions with NRS 1

SBFL Metrics Without NRS NRS1

Frequency

of

Improvement (%)

Frequency

of

Equivalent (%)

Frequency

of

Decline (%)

Average Improvement

Average Decline

Naish1 5.44 5.44 0.00% 100.00% 0.00% 0.00 0.00

Naish2 5.17 5.17 0.00% 100.00% 0.00% 0.00 0.00

Jaccard 8.87 8.64 19.35% 72.58% 8.06% 1.39 0.49

Anderberg 8.87 8.64 19.35% 72.58% 8.06% 1.39 0.49

Sorensen-Dice 8.87 8.64 19.35% 72.58% 8.06% 1.39 0.49

Dice 8.87 8.64 19.35% 72.58% 8.06% 1.39 0.49

Tarantula 9.22 9.05 11.29% 82.26% 6.45% 1.92 0.63

QE 9.22 9.05 11.29% 82.26% 6.45% 1.92 0.63

CBI_Inc. 9.22 9.05 11.29% 82.26% 6.45% 1.92 0.63

Simple_Matching 17.39 16.96 25.81% 74.19% 0.00% 1.67 0.00

Sokal 17.39 16.96 25.81% 74.19% 0.00% 1.67 0.00

Rogers&Tanimoto 17.39 16.96 25.81% 74.19% 0.00% 1.67 0.00

Hamming_etc. 17.39 16.96 25.81% 74.19% 0.00% 1.67 0.00

Euclid 17.39 16.96 25.81% 74.19% 0.00% 1.67 0.00

Wong1 9.24 9.24 0.00% 100.00% 0.00% 0.00 0.00

Russel & Rao 9.24 9.24 0.00% 100.00% 0.00% 0.00 0.00

Binary 9.52 9.52 0.00% 100.00% 0.00% 0.00 0.00

Ochiai 6.91 6.86 12.90% 74.19% 12.90% 1.34 0.88

M2 6.07 6.08 1.61% 93.55% 4.84% 1.97 0.93

AMPLE2 9.24 9.24 0.00% 100.00% 0.00% 0.00 0.00

Wong3 5.92 5.92 0.00% 100.00% 0.00% 0.00 0.00

Arithmetic_Mean 7.70 7.59 11.29% 82.26% 6.45% 1.44 0.73

Cohen 9.17 8.96 12.90% 80.65% 6.45% 1.93 0.63

Kulczynski1 8.87 8.64 19.35% 72.58% 8.06% 1.39 0.49

M1 17.39 16.96 25.81% 74.19% 0.00% 1.67 0.00

Ochiai2 10.01 9.79 12.90% 85.48% 1.61% 1.70 0.18

Zoltar 5.19 5.26 0.00% 93.55% 6.45% 0.00 1.16

Ample 9.83 9.75 12.90% 82.26% 4.84% 1.53 2.48

Geometric_Mean 7.89 7.80 9.68% 80.65% 9.68% 1.68 0.75

Harmonic_Mean 8.11 7.93 12.90% 85.48% 1.61% 1.55 1.30

Rogot2 8.11 7.93 12.90% 85.48% 1.61% 1.55 1.30

99



Frequency

of

Improvement (%)

Frequency

of

Equivalent (%)

Frequency

of

Decline (%)

Average Improvement

Average Decline

Naish1 5.44 5.90 0.00% 80.65% 19.35% 0.00 2.38

Naish2 5.17 5.63 0.00% 80.65% 19.35% 0.00 2.38

Jaccard 8.87 8.86 19.35% 54.84% 25.81% 2.40 1.76

Anderberg 8.87 8.86 19.35% 54.84% 25.81% 2.40 1.76

Sorensen-Dice 8.87 8.86 19.35% 54.84% 25.81% 2.40 1.76

Dice 8.87 8.86 19.35% 54.84% 25.81% 2.40 1.76

Tarantula 9.22 9.17 20.97% 50.00% 29.03% 2.37 1.52

QE 9.22 9.17 20.97% 50.00% 29.03% 2.37 1.52

CBI_Inc. 9.22 9.17 20.97% 50.00% 29.03% 2.37 1.52

Simple_Matching 17.39 17.55 0.00% 90.32% 9.68% 0.00 1.69

Sokal 17.39 17.55 0.00% 90.32% 9.68% 0.00 1.69

Rogers&Tanimoto 17.39 17.55 0.00% 90.32% 9.68% 0.00 1.69

Hamming_etc. 17.39 17.55 0.00% 90.32% 9.68% 0.00 1.69

Euclid 17.39 17.55 0.00% 90.32% 9.68% 0.00 1.69

Wong1 9.24 9.56 0.00% 79.03% 20.97% 0.00 1.53

Russel & Rao 9.24 9.56 0.00% 79.03% 20.97% 0.00 1.53

Binary 9.52 9.84 0.00% 79.03% 20.97% 0.00 1.53

Ochiai 6.91 7.05 16.13% 45.16% 38.71% 1.73 1.08

M2 6.07 6.26 16.13% 61.29% 22.58% 1.21 1.69

AMPLE2 9.24 9.56 0.00% 79.03% 20.97% 0.00 1.53

Wong3 5.92 13.51 0.00% 75.81% 24.19% 0.00 31.38

Arithmetic_Mean 7.70 7.66 20.97% 46.77% 32.26% 1.82 1.04

Cohen 9.17 9.08 22.58% 53.23% 24.19% 2.17 1.68

Kulczynski1 8.87 8.86 19.35% 54.84% 25.81% 2.40 1.76

M1 17.39 17.55 0.00% 90.32% 9.68% 0.00 1.69

Ochiai2 10.01 9.78 33.87% 51.61% 14.52% 1.58 2.10

Zoltar 5.19 5.66 0.00% 77.42% 22.58% 0.00 2.09

Ample 9.83 9.87 20.97% 53.23% 25.81% 1.44 1.33

Geometric_Mean 7.89 7.70 22.58% 53.23% 24.19% 1.82 0.89

Harmonic_Mean 8.11 7.88 20.97% 58.06% 20.97% 2.37 1.26

Rogot2 8.11 7.88 20.97% 58.06% 20.97% 2.37 1.26

100



Frequency

of

Improvement (%)

Frequency

of

Equivalent (%)

Frequency

of

Decline (%)

Average Improvement

Average Decline

Naish1 5.44 5.90 0.00% 80.65% 19.35% 0.00 2.38

Naish2 5.17 5.63 0.00% 80.65% 19.35% 0.00 2.38

Jaccard 8.87 8.55 25.81% 53.23% 20.97% 2.39 1.43

Anderberg 8.87 8.55 25.81% 53.23% 20.97% 2.39 1.43

Sorensen-Dice 8.87 8.55 25.81% 53.23% 20.97% 2.39 1.43

Dice 8.87 8.55 25.81% 53.23% 20.97% 2.39 1.43

Tarantula 9.22 8.90 25.81% 50.00% 24.19% 2.54 1.36

QE 9.22 8.90 25.81% 50.00% 24.19% 2.54 1.36

CBI_Inc. 9.22 8.90 25.81% 50.00% 24.19% 2.54 1.36

Simple_Matching 17.39 17.13 16.13% 80.65% 3.23% 1.75 0.80

Sokal 17.39 17.13 16.13% 80.65% 3.23% 1.75 0.80

Rogers&Tanimoto 17.39 17.13 16.13% 80.65% 3.23% 1.75 0.80

Hamming_etc. 17.39 17.13 16.13% 80.65% 3.23% 1.75 0.80

Euclid 17.39 17.13 16.13% 80.65% 3.23% 1.75 0.80

Wong1 9.24 9.56 0.00% 79.03% 20.97% 0.00 1.53

Russel & Rao 9.24 9.56 0.00% 79.03% 20.97% 0.00 1.53

Binary 9.52 9.84 0.00% 79.03% 20.97% 0.00 1.53

Ochiai 6.91 6.93 20.97% 46.77% 32.26% 1.63 1.09

M2 6.07 6.21 12.90% 64.52% 22.58% 1.20 1.30

AMPLE2 9.24 9.56 0.00% 79.03% 20.97% 0.00 1.53

Wong3 5.92 13.54 0.00% 75.81% 24.19% 0.00 31.49

Arithmetic_Mean 7.70 7.55 25.81% 50.00% 24.19% 1.67 1.13

Cohen 9.17 8.83 27.42% 53.23% 19.35% 2.41 1.66

Kulczynski1 8.87 8.51 29.03% 50.00% 20.97% 2.25 1.43

M1 17.39 17.09 22.58% 74.19% 3.23% 1.41 0.80

Ochiai2 10.01 9.45 41.94% 45.16% 12.90% 1.65 1.07

Zoltar 5.19 5.66 0.00% 77.42% 22.58% 0.00 2.09

Ample 9.83 9.70 29.03% 51.61% 19.35% 1.48 1.56

Geometric_Mean 7.89 7.54 29.03% 53.23% 17.74% 1.81 1.01

Harmonic_Mean 8.11 7.69 27.42% 48.39% 24.19% 2.31 0.88

Rogot2 8.11 7.69 27.42% 48.39% 24.19% 2.31 0.88

101



Frequency

of

Improvement (%)

Frequency

of

Equivalent (%)

Frequency

of

Decline (%)

Average Improvement

Average Decline

Naish1 5.44 5.67 11.29% 77.42% 11.29% 2.61 4.64

Naish2 5.17 5.40 11.29% 77.42% 11.29% 2.61 4.64

Jaccard 8.87 8.24 48.39% 29.03% 22.58% 3.22 4.14

Anderberg 8.87 8.24 48.39% 29.03% 22.58% 3.22 4.14

Sorensen-Dice 8.87 8.24 48.39% 29.03% 22.58% 3.22 4.14

Dice 8.87 8.24 48.39% 29.03% 22.58% 3.22 4.14

Tarantula 9.22 8.95 35.48% 32.26% 32.26% 3.65 3.17

QE 9.22 8.95 35.48% 32.26% 32.26% 3.65 3.17

CBI_Inc. 9.22 8.95 35.48% 32.26% 32.26% 3.65 3.17

Simple_Matching 17.39 17.74 40.32% 19.35% 40.32% 2.85 3.72

Sokal 17.39 17.74 40.32% 19.35% 40.32% 2.85 3.72

Rogers&Tanimoto 17.39 17.74 40.32% 19.35% 40.32% 2.85 3.72

Hamming_etc. 17.39 17.74 40.32% 19.35% 40.32% 2.85 3.72

Euclid 17.39 17.74 40.32% 19.35% 40.32% 2.85 3.72

Wong1 9.24 9.24 0.00% 100.00% 0.00% 0.00 0.00

Russel & Rao 9.24 9.24 0.00% 100.00% 0.00% 0.00 0.00

Binary 9.52 9.52 0.00% 100.00% 0.00% 0.00 0.00

Ochiai 6.91 6.51 43.55% 40.32% 16.13% 2.63 4.63

M2 6.07 6.18 30.65% 50.00% 19.35% 1.83 3.47

AMPLE2 9.24 9.24 0.00% 100.00% 0.00% 0.00 0.00

Wong3 5.92 6.38 16.13% 64.52% 19.35% 1.90 3.98

Arithmetic_Mean 7.70 7.69 43.55% 37.10% 19.35% 1.97 4.37

Cohen 9.17 8.82 38.71% 33.87% 27.42% 3.50 3.67

Kulczynski1 8.87 8.24 48.39% 29.03% 22.58% 3.22 4.14

M1 17.39 17.74 40.32% 19.35% 40.32% 2.85 3.72

Ochiai2 10.01 10.44 40.32% 35.48% 24.19% 2.41 5.82

Zoltar 5.19 5.41 12.90% 75.81% 11.29% 2.35 4.64

Ample 9.83 10.74 35.48% 27.42% 37.10% 3.07 5.39

Geometric_Mean 7.89 7.63 40.32% 40.32% 19.35% 2.64 4.16

Harmonic_Mean 8.11 7.95 43.55% 20.97% 35.48% 2.58 2.70

Rogot2 8.11 7.95 43.55% 20.97% 35.48% 2.58 2.70

102



Frequency

of

Improvement (%)

Frequency

of

Equivalent (%)

Frequency

of

Decline (%)

Average Improvement

Average Decline

Naish1 5.44 5.67 11.29% 77.42% 11.29% 2.61 4.64

Naish2 5.17 5.40 11.29% 77.42% 11.29% 2.61 4.64

Jaccard 8.87 8.31 45.16% 29.03% 25.81% 3.35 3.69

Anderberg 8.87 8.31 45.16% 29.03% 25.81% 3.35 3.69

Sorensen-Dice 8.87 8.31 45.16% 29.03% 25.81% 3.35 3.69

Dice 8.87 8.31 45.16% 29.03% 25.81% 3.35 3.69

Tarantula 9.22 9.03 35.48% 30.65% 33.87% 3.49 3.08

QE 9.22 9.03 35.48% 30.65% 33.87% 3.49 3.08

CBI_Inc. 9.22 9.03 35.48% 30.65% 33.87% 3.49 3.08

Simple_Matching 17.39 17.32 41.94% 19.35% 38.71% 3.73 3.86

Sokal 17.39 17.32 41.94% 19.35% 38.71% 3.73 3.86

Rogers&Tanimoto 17.39 17.32 41.94% 19.35% 38.71% 3.73 3.86

Hamming_etc. 17.39 17.32 41.94% 19.35% 38.71% 3.73 3.86

Euclid 17.39 17.32 41.94% 19.35% 38.71% 3.73 3.86

Wong1 9.24 9.24 0.00% 100.00% 0.00% 0.00 0.00

Russel & Rao 9.24 9.24 0.00% 100.00% 0.00% 0.00 0.00

Binary 9.52 9.52 0.00% 100.00% 0.00% 0.00 0.00

Ochiai 6.91 6.62 40.32% 40.32% 19.35% 2.67 4.05

M2 6.07 6.20 30.65% 50.00% 19.35% 1.83 3.58

AMPLE2 9.24 9.24 0.00% 100.00% 0.00% 0.00 0.00

Wong3 5.92 6.42 16.13% 64.52% 19.35% 1.90 4.17

Arithmetic_Mean 7.70 7.79 38.71% 38.71% 22.58% 2.07 3.91

Cohen 9.17 8.84 38.71% 33.87% 27.42% 3.43 3.66

Kulczynski1 8.87 8.31 45.16% 29.03% 25.81% 3.35 3.69

M1 17.39 17.32 41.94% 19.35% 38.71% 3.73 3.86

Ochiai2 10.01 10.35 40.32% 35.48% 24.19% 2.63 5.82

Zoltar 5.19 5.45 12.90% 69.35% 17.74% 2.35 3.16

Ample 9.83 10.92 35.48% 27.42% 37.10% 2.80 5.60

Geometric_Mean 7.89 7.68 41.94% 33.87% 24.19% 2.54 3.51

Harmonic_Mean 8.11 8.08 43.55% 17.74% 38.71% 2.35 2.57

Rogot2 8.11 8.08 43.55% 17.74% 38.71% 2.35 2.57

103



Frequency

of

Improvement (%)

Frequency

of

Equivalent (%)

Frequency

of

Decline (%)

Average Improvement

Average Decline

Naish1 5.44 6.14 9.68% 61.29% 29.03% 2.12 3.11

Naish2 5.17 5.86 9.68% 61.29% 29.03% 2.12 3.11

Jaccard 8.87 8.26 46.77% 32.26% 20.97% 3.54 5.01

Anderberg 8.87 8.26 46.77% 32.26% 20.97% 3.54 5.01

Sorensen-Dice 8.87 8.26 46.77% 32.26% 20.97% 3.54 5.01

Dice 8.87 8.26 46.77% 32.26% 20.97% 3.54 5.01

Tarantula 9.22 8.73 38.71% 30.65% 30.65% 4.10 3.55

QE 9.22 8.73 38.71% 30.65% 30.65% 4.10 3.55

CBI_Inc. 9.22 8.73 38.71% 30.65% 30.65% 4.10 3.55

Simple_Matching 17.39 17.98 33.87% 25.81% 40.32% 2.88 3.88

Sokal 17.39 17.98 33.87% 25.81% 40.32% 2.88 3.88

Rogers&Tanimoto 17.39 17.98 33.87% 25.81% 40.32% 2.88 3.88

Hamming_etc. 17.39 17.98 33.87% 25.81% 40.32% 2.88 3.88

Euclid 17.39 17.98 33.87% 25.81% 40.32% 2.88 3.88

Wong1 9.24 9.56 0.00% 79.03% 20.97% 0.00 1.53

Russel & Rao 9.24 9.56 0.00% 79.03% 20.97% 0.00 1.53

Binary 9.52 9.84 0.00% 79.03% 20.97% 0.00 1.53

Ochiai 6.91 6.76 38.71% 37.10% 24.19% 2.52 3.41

M2 6.07 6.52 25.81% 50.00% 24.19% 1.67 3.66

AMPLE2 9.24 9.56 0.00% 79.03% 20.97% 0.00 1.53

Wong3 5.92 13.11 12.90% 53.23% 33.87% 1.67 21.86

Arithmetic_Mean 7.70 7.58 40.32% 35.48% 24.19% 2.36 3.43

Cohen 9.17 8.59 43.55% 32.26% 24.19% 3.74 4.35

Kulczynski1 8.87 8.26 46.77% 32.26% 20.97% 3.54 5.01

M1 17.39 17.98 33.87% 25.81% 40.32% 2.88 3.88

Ochiai2 10.01 10.29 45.16% 30.65% 24.19% 2.81 6.42

Zoltar 5.19 5.89 9.68% 59.68% 30.65% 2.12 2.97

Ample 9.83 10.86 33.87% 29.03% 37.10% 2.45 5.02

Geometric_Mean 7.89 7.60 40.32% 37.10% 22.58% 2.80 3.71

Harmonic_Mean 8.11 7.84 45.16% 25.81% 29.03% 2.68 3.25

Rogot2 8.11 7.84 45.16% 25.81% 29.03% 2.68 3.25

104



Frequency

of

Improvement (%)

Frequency

of

Equivalent (%)

Frequency

of

Decline (%)

Average Improvement

Average Decline

Naish1 5.44 6.13 9.68% 62.90% 27.42% 2.12 3.26

Naish2 5.17 5.85 9.68% 62.90% 27.42% 2.12 3.26

Jaccard 8.87 8.25 46.77% 32.26% 20.97% 3.56 5.01

Anderberg 8.87 8.25 46.77% 32.26% 20.97% 3.56 5.01

Sorensen-Dice 8.87 8.25 46.77% 32.26% 20.97% 3.56 5.01

Dice 8.87 8.25 46.77% 32.26% 20.97% 3.56 5.01

Tarantula 9.22 8.69 40.32% 27.42% 32.26% 4.04 3.41

QE 9.22 8.69 40.32% 27.42% 32.26% 4.04 3.41

CBI_Inc. 9.22 8.69 40.32% 27.42% 32.26% 4.04 3.41

Simple_Matching 17.39 17.74 40.32% 19.35% 40.32% 2.85 3.72

Sokal 17.39 17.74 40.32% 19.35% 40.32% 2.85 3.72

Rogers&Tanimoto 17.39 17.74 40.32% 19.35% 40.32% 2.85 3.72

Hamming_etc. 17.39 17.74 40.32% 19.35% 40.32% 2.85 3.72

Euclid 17.39 17.74 40.32% 19.35% 40.32% 2.85 3.72

Wong1 9.24 9.56 0.00% 79.03% 20.97% 0.00 1.53

Russel & Rao 9.24 9.56 0.00% 79.03% 20.97% 0.00 1.53

Binary 9.52 9.84 0.00% 79.03% 20.97% 0.00 1.53

Ochiai 6.91 6.75 38.71% 37.10% 24.19% 2.56 3.41

M2 6.07 6.51 25.81% 51.61% 22.58% 1.67 3.89

AMPLE2 9.24 9.56 0.00% 79.03% 20.97% 0.00 1.53

Wong3 5.92 12.08 12.90% 54.84% 32.26% 1.67 19.76

Arithmetic_Mean 7.70 7.56 41.94% 35.48% 22.58% 2.32 3.66

Cohen 9.17 8.56 43.55% 32.26% 24.19% 3.80 4.33

Kulczynski1 8.87 8.22 50.00% 29.03% 20.97% 3.41 5.01

M1 17.39 17.70 40.32% 19.35% 40.32% 2.94 3.72

Ochiai2 10.01 10.25 45.16% 30.65% 24.19% 2.89 6.41

Zoltar 5.19 5.89 9.68% 61.29% 29.03% 2.12 3.11

Ample 9.83 10.83 33.87% 30.65% 35.48% 2.52 5.22

Geometric_Mean 7.89 7.59 40.32% 35.48% 24.19% 2.83 3.48

Harmonic_Mean 8.11 7.83 43.55% 27.42% 29.03% 2.82 3.25

Rogot2 8.11 7.83 43.55% 27.42% 29.03% 2.82 3.25

Documents

On improving the accuracy of spectrum-based fault localization€¦ · On Improving the Accuracy of Spectrum-based Fault Localization . Patrick Daniel . Master of Science . 2014