arXiv:1905.10889v1 [cs.SE] 26 May 2019

Noname manuscript No.(will be inserted by the editor)

Improving Change Prediction Models with CodeSmell-Related Information

Gemma Catolino · Fabio Palomba ·Francesca Arcelli Fontana · AndreaDe Lucia · Andy Zaidman · FilomenaFerrucci

Received: date / Accepted: date

Abstract Code smells represent sub-optimal implementation choices appliedby developers when evolving software systems. The nagative impact of codesmells has been widely investigated in the past: besides developers’ produc-tivity and ability to comprehend source code, researchers empirically showedthat the presence of code smells heavily impacts the change-proneness of theaffected classes. On the basis of these findings, in this paper we conjecturethat code smell-related information can be effectively exploited to improvethe performance of change prediction models, i.e., models having as goal thatof indicating to developers which classes are more likely to change in the future,so that they may apply preventive maintenance actions. Specifically, we exploitthe so-called intensity index—a previously defined metric that captures theseverity of a code smell—and evaluate its contribution when added as addi-tional feature in the context of three state of the art change prediction modelsbased on product, process, and developer-based features. We also compare theperformance achieved by the proposed model with the one of an alternativetechnique that considers the previously defined antipattern metrics, namely aset of indicators computed considering the history of code smells in files. Ourresults report that (i) the prediction performance of the intensity-including

Gemma Catolino, Andrea De Lucia, Filomena FerrucciUniversity of Salerno, ItalyE-mail: [email protected], [email protected], [email protected]

Fabio PalombaUniversity of Zurich, SwitzerlandE-mail: [email protected]

Francesca Arcelli FontanaUniversity of Milano-Bicocca, ItalyE-mail: [email protected]

Andy ZaidmanDelft University of Technology, The NetherlandsE-mail: [email protected]

arX

iv:1

905.

1088

9v1

[cs

.SE

] 2

6 M

ay 2

019

2 Gemma Catolino et al.

models is statistically better than that of the baselines and (ii) the intensityis a more powerful metric with respect to the alternative smell-related ones.Nevertheless, we observed some complementarities between the set of change-prone and non-change-prone classes correctly classified by the models relyingon intensity and antipattern metrics: for this reason, we devise and evaluatea smell-aware combined change prediction model including product, process,developer-based, and smell-related features. We show that this model has anF-Measure that is up to 20% higher than the existing state-of-the art models.

Keywords Change Prediction · Code Smells · Empirical Study

1 Introduction

During software evolution, change is unavoidable. Indeed, software systems arecontinuously modified in order to be adapted to changing needs, improved inperformance or maintainability, or fixed from potential bugs [58]. As a conse-quence, they become more complex, possibly eroding the original design witha subsequent reduction of their overall maintainability [88]. In this context,predicting the source code components having a higher likelihood to change inthe future represents an important activity to allow developers to plan preven-tive maintenance operations such as, e.g., refactoring [35] or peer-code reviews[7].

For this reason, the research community proposed several approaches inorder to allow developers to control these changes [20, 27, 28, 38, 54, 59, 60,94, 128]. Such approaches are based on the usage of machine learning modelsexploiting several predictors that capture different characteristics of classesi.e., structural, process, and developer-related features.

Despite the good performance shown by those existing models, recentstudies [52, 83] explored new factors contributing to the change-pronenessof classes, finding that it is strongly influenced by the presence of the so-calledbad code smells [35], i.e., sub-optimal design and/or implementation choicesapplied by practitioners when developing a software system. Specifically, suchstudies showed that smelly classes are significantly more likely to be the sub-ject of changes than classes not affected by any design problem.

In this paper, we capture such findings and empirically investigate the ex-tent to which smell-related information can be actually useful when consideredin the context of the prediction of change-prone classes: our conjecture is thatthe addition of a measure of code smell severity can improve the performanceof existing change prediction models, as it may help in the correct assessmentof the change-proneness of classes. For severity, we mean a metric able toquantify how much a certain code smell instance is harmful for the design of asource code class. To test our conjecture, we (i) add the intensity index definedby Arcelli Fontana et al. [32] in three state of the art change prediction mod-els based on structural [128], process [27], and developer-related metrics [20]and (ii) evaluate—on 43 releases of 14 large systems—how much such additionimproves the prediction capabilities of the baseline models. We then compare

Improving Change Prediction Models with Code Smell-Related Information 3

the performance of the intensity-including change prediction models with theone achievable by exploiting alternative smell-related information such as theantipattern metrics defined by Taba et al. [110], which are able to capturehistorical information on code smell instances (e.g., the recurrence of a certaininstance over time). The results show that the addition of the intensity indexprovides significant improvements in the performance of the baseline changeprediction models, with an increase of 10% in terms of F-Measure on average.Furthermore, the intensity-including models work better than the models builtadding the alternative antipattern metrics, although these two types of infor-mation are complementary, i.e., they correctly capture the change-pronenessof different change-prone smelly classes.

Given such a complementarity, we then further explore the possibility toimprove change prediction models by devising a smell-aware combined ap-proach that mixes together the features of the experimented models, i.e.,structural, process, developer-, and smell-related information, with the aim ofboosting the change-proneness prediction abilities. As a result, we discoveredthat such a combined model is able to improve by up to 20% the performanceof the baseline approaches.

To sum up, the contributions of this paper are the following:

1. A large-scale empirical assessment of the role of the intensity index [32]when predicting change-prone classes;

2. An empirical comparison between the capabilities of the intensity indexand the antipattern metrics defined by Taba et al. [110] in the context ofchange prediction;

3. A novel smell-aware combined change prediction model, which has moreeffective performance than the state of the art;

4. A replication package that includes all the raw data and working data setsof our study [19].

Structure of the paper. Section 2 discusses the related literature on changeprediction models and code smell. Section 3 describes the design of the casestudy aimed at evaluating the performance of the models, while Section 4reports the results achieved. Section 5 discusses the threats to the validityof our empirical study. Finally, Section 6 concludes the paper and outlinesdirections for future work.

2 Related Work

Change-prone classes represent source code components that, for different rea-sons, tend to change more often than others. This phenomenon has been widelyinvestigated by the research community [52, 68, 26, 14, 105, 83] with the aimof studying the factors contributing to the change-proneness of classes. Amongall these studies, Khomh et al. [52] showed that the presence of sub-optimalimplementations in Java classes, i.e., code smells, has a strong impact on thelikelihood that such classes will be often modified by developers. The results


were later confirmed by several studies in the field [22, 75, 83, 107], furtherhighlighting the relevance of code smells for change-proneness. Our work isclearly based on these findings, and aims at providing additional evidence ofhow code smells can be adopted in the context of prediction models havingthe goal of identifying change-prone classes.

Keeping track of classes that are more prone to change can be relevantfor two main reasons: on the one hand, to create awareness among developersabout the fact that these classes tend to change frequently; on the other hand,to highlight the presence of classes that might require preventive maintenanceactions (e.g., code review [7] or refactoring [35]) aimed at improving the qualityof the source code. In this regard, previous researchers heavily investigatedthe feasibility of machine learning techniques for the identification of change-prone classes [20, 27, 28, 38, 54, 59, 60, 94, 128]. In the following, we discussthe advances achieved in the context of change prediction models. At thesame time, as this paper reports on the role of code smell intensity, we alsosummarize the literature related to the detection and prioritization of codesmells.

2.1 Change Prediction Approaches

The most relevant body of knowledge related to change prediction techniques isrepresented by the use of product and process metrics as independent variablesable to characterize the change-proneness of software artifacts [2, 6, 17, 59,60, 61, 118, 128]. Specifically, Romano et al. [94] relied on code metrics forpredicting change-prone the so-called fat interfaces (i.e., poorly-cohesive Javainterfaces), while Eski et al. [28] proposed a model based on both CK andQMOOD metrics [11] to estimate change-prone classes and to determine partsof the source code that should be tested first and more deeply.

Conversely, Elish et al. [27] reported the potential usefulness of processmetrics for change prediction. In particular, they defined a set of evolutionmetrics that describe the historical characteristics of software classes: for in-stance, they defined metrics like the birth date of a class or the total amountof changes applied in the past. As a result, their findings showed that a predic-tion model based on those evolution metrics can overcome the performance ofstructural-based techniques. These results were partially confirmed by Girbaet al. [38], who defined a tool that suggests change-prone code elements bysummarizing previous changes. In a small-scale empirical study involving twosystems, they observed that previous changes can effectively predict futuremodifications.

More recently, Catolino et al. [20] empirically assessed the role of developer-related factors in change prediction. To this aim, they studied the performanceof three developer-based prediction models relying on (i) entropy of develop-ment process [45], (ii) number of developers working on a certain class [13],and (iii) structural and semantic scattering of changes [24], showing that theycan be more accurate than models based on product or process metrics. Fur-


thermore, they also defined a combined model which considers a mixture ofmetrics and that has shown to be up to 22% more accurate than the previouslydefined ones.

Our work builds upon the findings reported above. In particular, we studyto what extent the addition of information related to the presence and severityof code smells can contribute to the performance of change prediction modelsbased on product, process, and developer-based metrics.

Another consistent part of the state of the art concerns with the use ofalternative methodologies to predict change-prone classes. For instance, thecombination of (i) dependencies mined from UML diagrams and code met-rics [43, 44, 96, 99, 100], and (ii) genetic and learning algorithms [62, 64, 89]have been proposed. Finally, some studies focused on the adoption of ensem-ble techniques for change prediction [18, 47, 55, 63]. In particular, Malhotraand Khanna [63] proposed a search-based solution to the problem, adoptinga Particle Swarm Optimization (PSO)-based classifier [47] for predicting thechange-proneness of classes. The study was conducted on five Android appli-cation packages and the results encouraged the use of the adopted solutionfor developing change prediction models. Kumar et al. [55] studied the cor-relation between 62 software metrics and the likelihood of a class to changein the future. Afterwards, they built a change prediction model consideringeight different machine learning algorithms and two ensemble techniques. Theresults showed that with the application of feature selection techniques, thechange prediction models relying on ensemble classifiers can obtain better re-sults. These results were partially contradicted by Catolino and Ferrucci [18],who empirically compared the performance of three ensemble techniques (i.e.,, Boosting, Random Forest, and Bagging) with the one of standard machinelearning classifiers (e.g., , Logistic Regression) on eight open source systems.The key results of the study showed how ensemble techniques in some casesperform better than standard machine learning approaches, however the dif-ferences among them is generally small.

2.2 Code Smell Detection and Prioritization

Fowler defined “bad code smells” (shortly, “code smells” or simply “smells”)as “symptoms of the presence of poor design or implementation choices ap-plied during the development of a software system” [35]. Starting from there,several researchers heavily investigated (i) how code smells evolve over time[84, 86, 91, 119, 120, 121], (ii) the way developers perceive them [79, 111, 126],and (iii) what is their impact on non-functional attributes of source code[1, 36, 50, 52, 77, 83, 104, 125]. All these studies came up with a shared conclu-sion: code smells negatively impact program comprehension, maintainabilityof source code, and development costs. In the scope of this paper, the most rel-evant empirical studies are those reported by Khomh et al. [52] and Palombaet al. [83], who explicitly investigated the impact of code smells on softwarechange proneness. Both the studies reported that classes affected by design


flaws tend to change more frequently than classes not affected by any codesmell. Moreover, refactoring practices notably help in keeping under controlthe change-proneness of classes. These studies are clearly those that motivateour work: indeed, following the findings on the influence of code smells, we be-lieve that the addition of information coming from the analysis of the severityof code smells can positively improves the performance of change predictionmodels. As explained later in Section 5, we measure the intensity rather thanthe simple presence/absence of smells because a severity metric can provideus with a more fine-grained information on how much a design problem is“dangerous” for a certain source code class.

Starting from the findings on the negative impact of code smells on sourcecode maintainability, the research community heavily focused on devising tech-niques able to automatically detect code smells. Most of these approaches relyon a two-step approach [51, 56, 65, 70, 72, 76, 117]: in the first one, a set ofstructural code metrics are computed and compared against predefined thresh-olds; in the second one, these metrics are combined using and/or operators inorder to define detection rules. If the logical relationships expressed in suchdetection rules are violated, a code smell is identified. While these approachesalready have good performance, Arcelli Fontana et al. [34] and Aniche et al.[3] proposed methods to further improve it by discarding false positive codesmell instances or tailoring the thresholds of code metrics, respectively.

Besides structural analysis, the use of alternative sources of informationfor smell detection has been proposed. Ratiu et al. [93] and Palomba et al.[78, 80] showed how historical information can be exploited for detecting codesmells. These approaches are particularly useful when dealing with design is-sues arising because of evolution problems (e.g., how a hierarchy evolves overtime). On the other hand, Palomba et al. [82, 87] adopted Information Re-trieval (IR) methods [8] to identify code smells characterized by promiscuousresponsibilities (Blob classes).

Furthermore, Arcelli Fontana et al. [5, 31] and Kessentini et al. [16, 48,49, 97] used machine learning and search-based algorithms to discover codesmells, pointing out that a training set composed of one hundred instancesis sufficient to reach very high values of accuracy. Nevertheless, Di Nucci etal. [25] recently showed that the performance of such techniques may varydepending on the exploited dataset.

Finally, Morales et al. [71] proposed a developer-based approach that lever-ages contextual information on the task a developer is currently working on torecommend what are the smells that can be removed on the portion of sourcecode referring to the performed task.

In parallel with the definition of code smell detectors, several researchersfaced the problem of prioritizing code smell instances based on their harm-fulness for the overall maintainability of a software project. Vidal et al. [123]developed a semi-automated approach that recommends a ranking of codesmells based on (i) past component modifications (e.g., number of changesduring the system history), (ii) important modifiability scenarios, and (iii) rel-evance of the kind of smell assigned by developers. In a follow-up work, the


same authors introduced a new criteria for prioritizing groups of code anoma-lies as indicators of architectural problems in evolving systems [122].

Lanza and Marinescu [56] proposed a metric-based rules approach in or-der to detect code smells, or identify code problems called disharmonies. Theclasses (or methods) that contain a high number of disharmonies are consid-ered more critical. Marinescu [66] also presented the Flaw Impact Score, i.e., ameasure of criticality of code smells that considers (i) negative influence of acode smell on coupling, cohesion, complexity, and encapsulation; (ii) granular-ity, namely the type of component (method or a class) that a smell affects; and(iii) severity, measured by one or more metrics analyzing the critical symptomsof the smell.

Murphy-Hill and Black [73] introduced an interactive visualization envi-ronment aimed at helping developers when assessing the harmfulness of codesmell instances. The idea behind the tool is to visualize classes like petals,and an increase of code smell severity corresponded with an increased petalsize. Other studies exploited developers’ knowledge in order to assign a level ofseverity with the aim to suggest relevant refactoring solutions [69], while Zhaoand Hayes [127] proposed a hierarchical approach to identify and prioritizerefactoring operations based on predicted improvement to the maintainabilityof the software.

Besides the prioritization approaches mentioned above, more recently Ar-celli Fontana and Zanoni [30] proposed the use of machine learning techniquesto predict code smell severity, reporting promising results. The same authorsalso proposed JCodeOdor [32], a code smell detector that is able to assign alevel of severity by computing the so-called intensity index, i.e., the extent towhich a set of structural metrics computed on smelly classes exceed the prede-fined thresholds: the higher the distance between the actual and the thresholdvalues the higher the severity of a code smell instance. As explained later inthe paper (Section 3), in the context of our study we adopt JCodeOdor sinceit has been previously evaluated on the dataset we exploited, reporting a highaccuracy. This was therefore the best option we had to conduct our work.

3 Research Methodology

In this section, we report the empirical study definition and design that wefollow to test the contribution given by the addition of the code smell intensityindex to existing change prediction models.

3.1 Research Questions

The goal of the empirical study was to evaluate the contribution of the intensityindex in prediction models aimed at discovering change-prone classes, with thepurpose of improving the allocation of resources in preventive maintenancetask such as code inspection [7] or refactoring [35]. The quality focus is on the


prediction performance of models that include code smell-related informationwhen compared to state of the art approaches, while the perspective is ofresearchers, who want to evaluate the effectiveness of using information aboutcode smells when identifying change-prone components. More specifically, theempirical investigation aimed at answering the following research questions:

– RQ1. To what extent does the intensity index improve the performance ofexisting change prediction models?

– RQ2. How does the model including the intensity index as predictor com-pare to a model built using antipattern metrics?

– RQ3. What is the gain provided by the intensity index to change predictionmodels when compared to other predictors?

– RQ4. What is the performance of a combined change prediction model thatincludes smell-related information?

As detailed in the next sections, the first research question (RQ1) aimedat investigating the contribution given by the intensity index within changeprediction models built using different types of predictors, i.e., product, pro-cess, and developer-related metrics. In RQ2 we empirically compared modelsrelying on two different types of smell-related information, i.e., the intensityindex [32] and the antipattern metrics proposed by Taba et al. [110]. RQ3

was concerned with a fine-grained analysis aimed at measuring the actualgain provided by the addition of the intensity metric within different changeprediction models. Finally, RQ4 had the goal to assess the performance ofa change prediction model built using a combination between smell-relatedinformation and other product, process, and developer-related features.

3.2 Context Selection

The context of the study was represented by a publicly available dataset com-ing from the Promise repository [67], which consisted of 43 releases of 14open source software systems. Table 1 reports (i) the name of the consideredprojects, (ii) the number of releases for each of them, (iii) their size (min-max)in terms of minimum and maximum number of classes and KLOCs acrossthe considered releases, (iv) the percentage (min-max) of change-prone classes(identified as explained later), and (iv) the percentage (min-max) of classesaffected by design problems (detected as explained later). Their selection wasdriven by our willingness to analyze projects having different application do-mains and size in order to decrease threats to external validity [37, 106]. Itis important to remark that the use of the Promise dataset required someadditional effort in cleaning the data it contains: indeed, as reported by Shep-perd et al. [101], such dataset may contain noise and/or erroneous entries thatpossibly negatively influence the results. To account for this aspect, beforerunning our experiments we performed a data cleaning on the basis of thealgorithm proposed by Shepperd et al. [101], which consists of 13 corrections


Table 1: Characteristics of the Software Projects in Our Dataset

System Releases Classes KLOCs % Change Cl. % Smelly Cl.Apache Ant 5 83-813 20-204 24 11-16Apache Camel 4 315-571 70-108 25 9-14Apache Forrest 3 112-628 18-193 64 11-13Apache Ivy 1 349 58 65 12Apache Log4j 3 205-281 38-51 26 15-19Apache Lucene 3 338-2,246 103-466 26 10-22Apache Pbeans 2 121-509 13-55 37 21-25Apache POI 4 129-278 68-124 22 15-19Apache Synapse 3 249-317 117-136 26 13-17Apache Tomcat 1 858 301 76 4Apache Velocity 3 229-341 57-73 26 7-13Apache Xalan 4 909 428 25 12-22Apache Xerces 3 162-736 62-201 24 5-9JEdit 5 228-520 39-166 23 14-22

able to remove identical features, features with conflicting or missing values,etc. During this step, we removed 58 entries from the original dataset.

As for the code smells, our investigation copes with six types of designproblems, namely:

– God Class (a.k.a., Blob): A poorly cohesive class that implements differentresponsibilities;

– Data Class: A class whose only purpose is holding data;– Brain Method : A large method implementing more than one function, being

therefore poorly cohesive;– Shotgun Surgery : A class where every change triggers many little changes

to several other classes;– Dispersed Coupling : A class having too many relationships with other

classes of the project;– Message Chains: A method containing a long chain of method calls.

The choice of focusing on these specific smells was driven by two mainaspects: (i) on the one hand, we took into account code smells characterizingdifferent design problems (e.g., excessive coupling vs poorly cohesive class-es/methods) and having different granularities; (ii) on the other hand, as ex-plained in the next section, we could rely on a reliable tool to properly identifyand compute their intensity in the classes of the exploited dataset.

3.3 RQ1 - The contribution of the Intensity Index

To answer our first research question, we needed to (i) identify code smells inthe subject projects and compute their intensity, and (ii) select a set of existingchange prediction models where to add the information on the intensity ofcode smells. Furthermore, we proceeded with the training and testing of thebuilt change prediction models. The following subsections detail the processconducted to perform such steps.


Table 2: Code Smell Detection Strategies (the complete names of the metricsare given in Table 3 and the explanation of the rules in Table 4)

Code Smells Detection Strategies: LABEL(n)→ LABEL has value n for that smell

God Class LOCNAMM ≥ HIGH(176) ∧ WMCNAMM ≥ MEAN(22) ∧ NOM-NAMM ≥ HIGH(18) ∧ TCC ≤ LOW(0.33) ∧ ATFD ≥ MEAN(6)

Data Class WMCNAMM ≤ LOW(14) ∧ WOC ≤ LOW(0.33) ∧ NOAM ≥MEAN(4) ∧ NOPA ≥ MEAN(3)

Brain Method (LOC ≥ HIGH(33) ∧ CYCLO ≥ HIGH(7) ∧ MAXNESTING ≥HIGH(6)) ∨ (NOLV ≥ MEAN(6) ∧ ATLD ≥ MEAN(5))

Shotgun Surgery CC ≥ HIGH(5) ∧ CM ≥ HIGH(6) ∧ FANOUT ≥ LOW(3)

Dispersed Coupling CINT ≥ HIGH(8) ∧ CDISP ≥ HIGH(0.66)

Message Chains MaMCL ≥ MEAN(3) ∨ (NMCS ≥ MEAN(3) ∧ MeMCL ≥ LOW(2))

3.3.1 Code Smell Intensity Computation

To compute the severity of code smells in the context of our work we employedJCodeOdor [32], a tool that is able to both identify code smells and assignto them a degree of severity by computing the so-called intensity index. Suchan index is represented by a real number contained in the range [1, 10]. Beforereporting the detailed steps adopted by JCodeOdor to compute the intensityindex, it is worth remarking that our choice to use this tool was driven by twoobservations. In the first place, JCodeOdor works with all the code smellsconsidered in our work: in literature it is the prioritization approach that dealwith the highest number of code smells (6 vs the 5 treated by Vidal et al.[123]). Moreover, it is fully automated, meaning that it does not require anyhuman intervention while computing the intensity of code smells. Finally, itis highly accurate: in a previous work by Palomba et al. [85] the tool wasempirically assessed on the same dataset adopted in our context, showing anF-Measure of 80%. For these reasons, we believe that JCodeOdor was thebest option we had to conduct our study.

From a technical point of view, given the set of classes composing a certainsoftware system the tool performs two basic steps to compute the intensity ofcode smells:

1. Detection Phase. Given a software system as input, the tool starts by de-tecting code smells relying on the detection strategies reported in Table 2.Basically, each strategy is represented by a logical composition of predi-cates, and each predicate is based on an operator that compares a metricwith a threshold [56, 81]. Such detection strategies are similar to thosedefined by Lanza and Marinescu [56], who used the set of code metricsdescribed in Table 3 to identify the six code smell types in our study. Toease the comprehension of the detection approach, Table 4 describes therationale behind the use of each predicate of the detection strategies.


Table 3: Metrics used for Code Smells Detection

Short Name Long Name Definition

ATFD Access To ForeignData

The number of attributes from unrelated classes be-longing to the system, accessed directly or by invokingaccessor methods.

ATLD Access To LocalData

The number of attributes declared by the currentclasses accessed by the measured method directly orby invoking accessor methods.

CC Changing Classes The number of classes in which the methods that callthe measured method are defined in.

CDISP Coupling Disper-sion

The number of classes in which the operations calledfrom the measured operation are defined, divided byCINT.

CINT Coupling Intensity The number of distinct operations called by the mea-sured operation.

CM Changing Methods The number of distinct methods that call the mea-sured method.

CYCLO McCabe Cyclo-matic Complexity

The maximum number of linearly independent pathsin a method. A path is linear if there is no branch inthe execution flow of the corresponding code.

FANOUT Number of called classes.LOC Lines Of Code The number of lines of code of an operation or of a

class, including blank lines and comments.LOCNAMM Lines of Code

Without Acces-sor or MutatorMethods

The number of lines of code of a class, including blanklines and comments and excluding accessor and muta-tor methods and corresponding comments.

MaMCL Maximum MessageChain Length

The maximum length of chained calls in a method.

MAXNESTING Maximum NestingLevel

The maximum nesting level of control structureswithin an operation.

MeMCL Mean MessageChain Length

The average length of chained calls in a method.

NMCS Number of Mes-sage Chain State-ments

The number of different chained calls in a method.

NOAM Number Of Acces-sor Methods

The number of accessor (getter and setter) methods ofa class.

NOLV Number Of LocalVariables

Number of local variables declared in a method. Themethod’s parameters are considered local variables.

NOMNAMM Number of Not Ac-cessor or MutatorMethods

The number of methods defined locally in a class,counting public as well as private methods, exclud-ingaccessor or mutator methods.

* NOMNAMM Number of Not Ac-cessor or MutatorMethods

The number of methods defined locally in a class,counting public as well as private methods, excludingaccessor or mutator methods.

NOPA Number Of PublicAttributes

The number of public attributes of a class.

TCC Tight Class Cohe-sion

The normalized ratio between the number of meth-ods directly connected with other methods throughan instance variable and the total number of possi-ble connections between methods. A direct connectionbetween two methods exists if both access the same in-stance variable directly or indirectly through a methodcall. TCC takes its value in the range [0,1].

WMCNAMM Weighted MethodsCount of Not Ac-cessor or MutatorMethods

The sum of complexity of the methods that are definedin the class, and are not accessor or mutator meth-ods. We compute the complexity with the CyclomaticComplexity metric (CYCLO).

WOC Weight Of Class The number of “functional” (i.e., non-abstract, non-accessor, non-mutator) public methods divided by thetotal number of public members.


Table 4: Code Smell Detection Rationale and Details

Clause Rationale

God

Cla

ss

LOCNAMM ≥ HIGH Too much code. We use LOCNAMM instead of LOC, be-cause getter and setter methods are often generated by theIDE. A class that has getter and setter methods, and a classthat has not getter and setter methods, must have the same“probability” to be detected as God Class.

WMCNAMM ≥ MEAN Too much work and complex. Each method has a minimumcyclomatic complexity of one, hence also getter and setteradd cyclomatic complexity to the class. We decide to use acomplexity metric that excludes them from the computation.

NOMNAMM ≥ HIGH Implements a high number of functions. We exclude getterand setter because we consider only the methods that effec-tively implement functionality of the class.

TCC ≤ LOW Functions accomplish different tasks.ATFD ≥ MEAN Uses many data from other classes.

Data

Cla

ss

WMCNAMM ≤ LOW Methods are not complex. Each method has a minimum cy-clomatic complexity of one, hence also getter and setter addcyclomatic complexity to the class. We decide to use a com-plexity metric that exclude them from the computation.

WOC ≤ LOW The class offers few functionalities. This metrics is com-puted as the number of functional (non-accessor) publicmethods, divided by the total number of public methods.A low value for the WOC metric means that the class offersfew functionalities.

NOAM ≥ MEAN The class has many accessor methods.NOPA ≥ MEAN The class has many public attributes.

Bra

inM

eth

od LOC ≥ HIGH Too much code.

CYCLO ≥ HIGH High functional complexityMAXNESTING ≥ HIGH High functional complexity. Difficult to understand.NOLV ≥ MEAN Difficult to understand. More the number of local variable,

more the method is difficult to understand.ATLD ≥ MEAN Uses many of the data of the class. More the number of

attributes of the class the method uses, more the method isdifficult to understand.

Sh

ot.

Su

rg. CC ≥ HIGH Many classes call the method.

CM ≥ HIGH Many methods to change.FANOUT ≥ LOW The method is subject to being changed. If a method in-

teracts with other classes, it is not a trivial one. We usethe FANOUT metric to refer Shotgun Surgery only to thosemethods that are more subject to be changed. We excludefor example most of the getter and setter methods.

Dis

.C

ou

p.

CINT ≥ HIGH The method calls too many other methods. With CINT met-ric, we measure the number of distinct methods called fromthe measured method.

CDISP ≥ HIGH Calls are dispersed in many classes. With CDISP metric,we measure the dispersion of called methods: the numberof classes in which the methods called from the measuredmethod are defined in, divided by CINT.

Mes

s.C

hain

MaMCL ≥ MEAN Maximum Message Chain Length. A Message Chains has aminimum length of two chained calls, because a single call istrivial. We use the MaMCL metric to find out the methodsthat have at least one chained call with a length greater thanthe mean.

NMCS ≥ MEAN Number of Message Chain Statements. There can be moreMessage Chain Statement: different chains of call. More thenumber of Message Chain Statements, more the method isinteresting respect to Message Chains code smell.

MeMCL ≥ LOW Mean of Message Chain Length. We would find out non-trivial Message Chains, so we need always to check againstthe Message Chain Statement length.


On the basis of these detection rules, a class/method of a project is markedas smelly if one of the logical propositions shown in Table 2 is true, i.e.,if the actual metrics computed on the class/method exceed the thresholdvalues defined in the detection strategy. It is worth pointing out that thethresholds used by JCodeOdor were empirically calibrated on 74 systemsbelonging to the Qualitas Corpus dataset [115] and are derived from thestatistical distribution of the metrics contained in the dataset [32, 33].

2. Intensity Computation. If a class/method is identified by the tool as smelly,the actual value of a given metric used for the detection will exceed thethreshold value, and it will correspond to a percentile value on the metricdistribution placed between the threshold and the maximum observed valueof the metric in the system under analysis. The placement of the actualmetric value in that range represents the “exceeding amount” of a metricwith respect to the defined threshold. Such “exceeding amounts” are thennormalized in the range [1,10] using a min-max normalization process [116]:specifically, this is a feature scaling technique where the values of a numericrange are reduced to a scale between 1 and 10. To compute z, i.e., thenormalized value, the following formula is applied:

z = [x−min(x)

max(x)−min(x)] · 10 (1)

where min and max are the minimum and maximum values observed inthe distribution. This step allows to have the “exceeding amount” of eachmetric in the same scale. To have a unique value representing the intensityof the code smell affecting the class, the mean of the normalized “exceedingamounts” is computed.

3.3.2 Selection of Basic Prediction Models

Our conjecture was concerned with the gain given by the addition of informa-tion on the intensity of code smells within existing change prediction models.To test such a conjecture, we needed to identify the state of the art techniqueswhere to add the intensity index: we selected three models based on product,process, and developer-related metrics that were shown to be accurate in thecontext of change prediction [20, 24, 27, 128].

Product Metrics-based Model. The first basic baseline is represented bythe change prediction model devised by Zhou et al. [128]. It is composed ofa set of metrics computed on the basis of the structural properties of sourcecode: these are cohesion (i.e., the Lack of Cohesion of Method — LCOM),coupling (i.e., the Coupling Between Objects — CBO — and the Response fora Class — RFC), and inheritance metrics (i.e., the Depth of Inheritance Tree— DIT). To actually compute these metrics, we relied on a publicly availabletool originally developed by Spinellis [108]. In the following, we refer to thismodel as SM, i.e., Structural Model.


Table 5: Independent variables considered by Elish et al.

Acronym MetricBOC Birth of a ClassFCH First Time Changes Introduced to a ClassFRCH Frequency of ChangesLCH Last Time Changes Introduced to a ClassWCD Weighted Change DensityWFR Weighted Frequency of ChangesTACH Total Amount of ChangesATAF Aggregated Change Size Normalized by

Frequency of ChangeCHD Change DensityLCA Last Change AmountLCD Last Change DensityCSB Changes since the BirthCSBS Changes since the Birth Normalized by

SizeACDF Aggregated Change Density Normalized

by Frequency of ChangeCHO Change Occurred

Process Metrics-based Model. In their study, Elish et al. [27] reportedthat process metrics can be exploited as better predictors of change-pronenesswith respect to structural metrics. For this reason, our second baseline wasthe Evolution Model (EM) proposed by Elish et al. [27]. More specifically, thismodel relies on the metrics shown in Table 5, which capture different aspectsof the evolution of classes, e.g., the weighted frequency of changes or the firsttime changes introduced. To compute these metrics, we adopted the tool thatwas previously developed by Catolino et al. [20]. In the following, we refer tothis model as PM, i.e., Process Model.

Developer-Related Model. In our previous work [20], we demonstratedhow developer-related factors can be exploited within change prediction mod-els since they provide orthogonal information with respect to product and pro-cess metrics that takes into account how developers perform modifications andhow complex the development process is. Among all the available developer-based models developed in literature [13, 24, 45], in this paper we relied on theDeveloper Changes Based Model (DCBM) devised by Di Nucci et al. [24], asit was shown to be the most effective one in the context of change prediction.Such a model uses as predictors the so-called structural and semantic scatter-ing of the developers that worked on a code component in a given time periodα. Specifically, for each class c, the two metrics are computed as follows:

StrScatPredc,α =∑

d∈developersc,α

StrScatd,α (2)

SemScatPredc,α =∑

d∈developersc,α

SemScatd,α (3)


where developersc,α represents the set of developers that worked on the classc during a certain period α, and the functions StrScatd,α and SemScatd,αreturn the structural and semantic scattering, respectively, of a developer d inthe time window α. Given the set CHd,α of classes changed by a developer dduring a certain period α, the formula of structural scattering of a developeris:

StrScatd,α = |CHd,α| × average∀ci,cj∈CHd,α

[dist(ci, cj)] (4)

where dist is the distance in number of packages from class ci to class cj . Thestructural scattering is computed by applying the shortest path algorithm onthe graph representing the systems package structure. Regarding the semanticscattering of a developer, it is based on the textual similarity of the classeschanged by a developer in a certain period α and it is computed as:

SemScatd,α = |CHd,α| ×1

average∀ci,cj∈CHd,α

[sim(ci, cj)](5)

where the sim function returns the textual similarity between the classes ciand cj according to the measurement performed using the Vector Space Model(VSM) [9]. The metric ranges between zero (no textual similarity) and one(the textual content of the two classes is identical). In our study, we set theparameter α of the approach as the time window between two releases R − 1and R, as done in previous work [85].

It is important to note that all the baseline models might be affected bymulti-collinearity [74], which occurs when two or more independent variablesare highly correlated and can be predicted one from the other, thus possiblyleading to a decrease of the prediction capabilities of the resulting model [102,113]. For this reason, we decided to use the vif (variance inflation factors)function [74] implemented in R1 to discard non-relevant variables. Vif is basedon the square of the multiple correlation coefficient resulting from regressinga predictor variable against all other predictor variables. If a variable hasa strong linear relationship with at least one other variable, the correlationcoefficient would be close to 1, and VIF for that variable would be large. AVIF greater than 10 is a signal that the model has a collinearity problem.

3.3.3 Dependent Variable

Our dependent variable is represented by the actual change-proneness of theclasses in our dataset. To compute it, we followed the guidelines provided byRomano et al. [94], who considered a class as change-prone if, in a given timeperiod TW, it underwent a number of changes higher than the median of thedistribution of the number of changes experienced by all the classes of thesystem. In particular, for each pair of commits (ci , ci+ 1) of TW we run

1 http://cran.r-project.org/web/packages/car/index.html


Table 6: Changes extracted by ChangeDistiller while computing thechange-proneness. ‘X’ symbols indicate the types we considered in our study.

ChangeDistiller Our StudyStatement-level changesStatement Ordering Change XStatement Parent Change XStatement Insert XStatement Delete XStatement Update XClass-body changesInsert attribute XDelete attribute XDeclaration-part changesAccess modifier update XFinal modifier update XDeclaration-part changesIncreasing accessibility change XDecreasing accessibility change XFinal Modified Insert XFinal Modified Delete XAttribute declaration changesAttribute type change XAttribute renaming change XMethod declaration changesReturn type insert XReturn type delete XReturn type update XMethod renaming XParameter insert XParameter delete XParameter ordering change XParameter renaming XClass declaration changesClass renaming XParent class insert XParent class delete XParent class update X

ChangeDistiller [29], a tree differencing algorithm able to extract the fine-grained code changes between ci and ci+ 1. Table 6 reports the entire list ofchange types identified by the tool. As it is possible to observe, we consideredall of them while computing the number of changes. It is worth mentioning thatthe tool ignores white space-related differences and documentation-related up-dates: in this way, it only considers the changes actually applied on the sourcecode. More importantly, ChangeDistiller is able to identify rename refac-toring operations: this means that we could handle cases where a class wasmodified during the change history, thus not biasing the correct counting ofthe number of changes. The dataset with the oracle is available on the onlineappendix [19].

3.3.4 Experimented Machine Learning Models

Once we had defined dependent and independent variables of interest, we couldfinally build machine learning models. As we were interested in understandingand measuring the contribution given by the intensity within the three base-


lines, we built two types of prediction models for each baseline: the first typeof model does not include the intensity index as predictor, thus relying on theoriginal features only; the second type of model that includes the intensityindex as an additional predictor. Using this procedure, we experimented with6 different models, and we could control the actual amount of improvementgiven by the intensity index with respect to the baselines (if any). It is worthremarking that, for non-smelly classes, the intensity value is set to 0.

3.3.5 Classifier Selection

The final step of our prediction model construction methodology was con-cerned with the selection of the best machine learning classifier able to dis-tinguish change-prone and non-change-prone classes. The current literatureproposed several alternatives (e.g., Romano and Pinzger [94] adopted SupportVector Machines [15], while Tsantalis et al. [118] relied on Logistic Regression[57]), and thus there is not a bullet-proof solution that ensures the best over-all performance. For this reason, in our work we experimented with differentclassifiers, i.e., ADTree [124], Decision Table Majority [53], Logistic Regres-sion [21], Multilayer Perceptron [95], Naive Bayes [46], and Simple LogisticRegression[90]. To select the right classifier to use in our situation, we em-pirically compared the results achieved when applying each classifier on eachexperimented model on the software systems in our study. Overall, the bestresults were obtained using the Simple Logistic Regression. In the remainingof the paper, we only report the results obtained when using this classifier,while a complete report of the performance of other classifiers is available inour online appendix [19].

3.3.6 Validation Strategy

As for validation strategy we adopted 10-Fold Cross Validation [109]. Thismethodology randomly partitions the data into 10 folds of equal size, applyinga stratified sampling. A single fold is used as test set, while the remaining onesare used as training set. The process was repeated 10 times, using each timea different fold as test set. Then, the model performances were reported usingthe mean achieved over the ten runs. It is important to note that we repeatedthe 10-fold validation 100 times (each time with a different seed) to cope withthe randomness arising from using different data splits [42].

3.3.7 Evaluation Metrics

To measure and compare the performance of the models, we started computingtwo well-known metrics such as precision and recall [10], which are defined asfollow:

precision =TP

TP + FPrecall =

TP

TP + TN(6)


where TP is the number of true positives, TN the number of true negatives,and FP the number of false positives. In the second place, to have a uniquevalue representing the goodness of the model, we computed the F-Measure,i.e., the harmonic mean of precision and recall:

F -Measure = 2 ∗ precision ∗ recallprecision+ recall

(7)

Moreover, we considered another indicator: the Area Under the ROC Curve(AUC-ROC) metric. This measure quantifies the overall ability of a changeprediction model to discriminate between change-prone and non-change-proneclasses: the closer the AUC-ROC to 1 the higher the ability of the classifier,while the closer the AUC-ROC to 0.5 the lower its accuracy. In other words,this metric can quantity how rubust the model is when discriminating the twobinary classes.

In addition, we compared the performance achieved by the experimentedprediction models from a statistical point of view. As we performed compar-isons over multiple datasets, we employed the Scott-Knott Effect Size Dif-ference (ESD) test [112, 114], considering the AUC-ROC that the differentmodels obtained over the considered systems. This test represents an effect-size aware variant of Scott-Knott test [98]: differently from the original one, it(i) hierarchically clusters the set of treatment means into statistically distinctgroups, (ii) corrects the non-normal distribution of a dataset if needed, and (iii)merges two statistically distinct groups in case their effect size—measured us-ing Cliff’s Delta (or d) [39]—is negligible, so that the creation of trivial groupsis avoided. To perform the test, we relied on the implementation provided byTantithamthavorn et al. [114].

3.4 RQ2 - Comparison between Intensity Index and Antipattern Metrics

In RQ2 our goal was to compare the performance of change prediction modelsrelying on the intensity index against the one achieved by models exploitingother existing smell-related metrics. In particular, the comparison was doneconsidering the so-called antipattern metrics, which were defined by Taba etal. [110]: these are three metrics aimed at capturing different aspects relatedto the maintainability of classes affected by code smells. More specifically:

– the Average Number of Antipatterns (ANA) computes how many codesmells there were in the previous releases of a class over the total numberof releases. This metric is based on the assumption that classes that havebeen more prone to be smelly in the past are somehow more prone to besmelly in the future;

– the Antipattern Complexity Metric (ACM) computes the entropy ofchanges involving smelly classes. Such entropy refers to the one originallydefined by Hassan [45] in the context of defect prediction. The conjecturebehind its use relates to the fact that a more complex development processmight lead to the introduction of code smells;


– the Antipattern Recurrence Length (ARL) measures the total number ofsubsequent releases in which a class has been affected by a smell. Thismetric relies on the same underlying conjecture as ANA, i.e., the more aclass has been smelly in the past the more it will be smelly in the future.

To compute these metrics, we employed the tool developed by Palombaet al. [85]. Then, as done in the context of RQ1, we plugged the antipatternmetrics into the experimented baselines and assess the performance of theresulting change prediction models using the same set of evaluation metricsdescribed in Section 3.3.7, i.e., F-Measure and AUC-ROC. Finally, we sta-tistically compared such performance with the one obtained by the modelsincluding the intensity index as predictor.

Besides the comparison in terms of evaluation metrics, we also analyzedthe extent to which the two types of models are complementary with respectto the classification of change-prone classes. This was done with the aim of as-sessing whether the two models, relying on different smell-related information,can correctly identify the change-proneness of different classes. More formally,let mint be the model built plugging in the intensity index; let mant be themodel built by considering the antipattern metrics, we computed the follow-ing overlap metrics on the set of smelly and change-prone instances of eachsystem:

TPmint∩mant =|TPmint ∩ TPmant ||TPmint ∪ TPmant |

% (8)

TPmint\mant =|TPmint \ TPmant ||TPmint ∪ TPmant |

% (9)

TPmant\mint =|TPmant \ TPmint ||TPmant ∪ TPmint |

% (10)

where TPmint represents the set of change-prone classes correctly classified bythe prediction model mint, while TPmant is the set of change-prone classes cor-rectly classified by the prediction model mant. The TPmint∩mant metric mea-sures the overlap between the sets of true positives correctly identified by bothmodels mint and mant, TPmint\mant measures the percentage of change-proneclasses correctly classified by mint only and missed by mant, and TPmant\mintmeasures the percentage of change-prone classes correctly classified by mant

only and missed by mint.

3.5 RQ3 - Gain Provided by the Intensity Index

In RQ3 we conducted a fine-grained investigation aimed at measuring howimportant is the intensity index with respect to the other features (i.e., prod-uct, process, developer-related, and antipattern metrics) composing the exper-imented models. To this aim, we used an information gain algorithm [92] to


quantify the gain provided by adding the intensity index in each predictionmodel. In our context, this algorithm ranked the features of the models accord-ing to their ability to predict the change-proneness of classes. More specifically,let M be a change prediction model, let P = {p1, . . . , pn} be the set of predic-tors composing M , an information gain algorithm [92] applies the followingformula to compute a measure which defines the difference in entropy frombefore to after the set M is split on an attribute pi:

InfoGain(M,pi) = H(M)−H(M |pi) (11)

where the function H(M) indicates the entropy of the model that includes thepredictor pi, while the function H(M |pi) measures the entropy of the modelthat does not include pi. Entropy is computed as follow:

H(M) = −n∑i=1

prob(pi) log2 prob(pi) (12)

From a more practical perspective, the algorithm quantifies how much un-certainty in M was reduced after splitting M on predictor pi. In our work, weemployed the Gain Ratio Feature Evaluation algorithm [92] implemented inthe Weka toolkit [41], which ranks p1, . . . , pn in descending order based onthe contribution provided by pi to the decisions made by M . In particular,the output of the algorithm is a ranked list in which the predictors having thehigher expected reduction in entropy are placed on the top. Using this pro-cedure, we evaluated the relevance of the predictors in the change predictionmodels experimented, possibly understanding whether the addition of the in-tensity index gives a higher contribution with respect to the structural metricsfrom which it is derived (i.e., metrics used for the detection of the smells) orwith respect the other metrics contained in the models.

3.6 RQ4 - Combining All Predictors and Smell-Related Information

As a final step of our study, we aimed to study the possibility to devise a com-bined model able to mix together standard change-proneness predictors (i.e.,structural, process, and developer-related metrics) and smell-related informa-tion to achieve better prediction performance. To do it, we firstly put all theindependent variables considered in the study in a single dataset, thus puttingthem all together. In the second place, we applied the variable removal proce-dure based on the vif function (see Section 3.3.2 for details on this technique):in this way, we were able to remove the independent variables that do not sig-nificantly influence the performance of the combined model. Finally, we testedthe ability of the newly devised model using the same procedures and metricsused in the context of RQ1, i.e., F-measure, AUC-ROC, and Brier score, andstatistically comparing the performance of the experimented models by meansof Scott-Knott ESD test.


4 Analysis of the Results

Fig. 1: Overview of the value of F-measure among the models

Fig. 2: Overview of the value of AUC-ROC among the models

In this section we report and sum up the results of the presented researchquestions, discussing the main findings of our study.

4.1 RQ1-RQ2: The performance of the intensity-including models and theircomparison with the state of the art

Before describing the results related to the contribution of the intensity indexin the three prediction models considered, we report the results of the featureselection process aimed at avoiding multi-collinearity. According to the resultsachieved using the vif function [74], we removed FCH, LCH, WFR, ATAF,CHD, LCD, CSBS, and ACDF from the process-based model [27], while wedid not remove any variables from the other baselines.


Figures 1 and 2 show the box plots reporting the distributions of F-Measureand AUC-ROC achieved by the (i) basic models that do not include any smell-related information - SM, PM, and DCBM, respectively; (ii) models includingthe antipattern metrics - those having “+ Ant. Metrics” as suffix; and (iii)models including the intensity index - those reporting “+ Int.” as suffix. Notethat for the sake of readability, we only report the distribution of F-Measurerather than the distributions of precision and recall. Detailed results for thosemetrics are available in our online appendix [19].

In the first place, looking at Figure 1, we can observe that the basic modelbased on scattering metrics (i.e., DCBM) tends to perform better than modelsbuilt using structural and process metrics. Indeed, DCBM [24] has a medianF-Measure 5% and 13% higher than structural (67% vs 54%) and process (67%vs 62%) models, respectively. On the one hand, this result confirms our previ-ous findings on the power of the developer-related factors in change prediction[20]; on the other hand, we can confirm the results achieved by Di Nucci etal. [24] on the value of the scattering metrics for the prediction of problem-atic classes. As for the role of the intensity index, we notice that with therespect to the SM, PM and DCBM model, the intensity of code smells pro-vides an additional useful information able to increase the ability of the modelin discovering change-prone code components. This is observable by lookingat the performance in Figures 1 and 2. In the following, we further discussour findings by reporting our results for each prediction model experimented,comparing the intensity-including ones with the state of the art.

Contribution in Structural-based Models. The addition of the intensityindex within the SM model allows the model to reach a median F-Measureof 60% and an AUC-ROC of 61%, respectively. When compared against theantipattern metrics-including model, the intensity-including one still performsbetter (i.e., +4% in terms of median F-Measure and +7% in terms of medianAUC-ROC).

Looking deeper into the results, we observed that the shapes of the boxplots for the intensity-including model appear less dispersed than the basic one,meaning that the addition of the intensity index makes the performance of themodel better and more stable. For instance, considering the Apache-ant-1.3

project, the basic structural model reached 50% precision and 56% recall (F-Measure=53%), while the model that includes the intensity index has a preci-sion of 61% and a recall of 66% (F-Measure=63%), thus obtaining an improve-ment of 10%. The same happens all the considered systems: we can thereforeclaim that the performances of change prediction models strongly improvewhen considering the intensity of code smells as additional independent vari-able.

When considering the structural model that includes the antipattern met-rics defined by Taba et al. [110], we notice that its performance is just slightlybetter than the basic model in terms of F-Measure (56% vs 54%); more inter-esting, the AUC-ROC of the “SM + Ant. Metrics” model is lower than thebasic one (53% vs 55%). From a practical perspective, these results tell us


that the inclusion of the antipattern metrics only provides slight improvementwith respect to the number of actual change-prone classes identified, but atthe same time cannot provide benefits in the robustness of the classifications.

In the comparison between the SM + Ant. Metrics and the SM + Int.models, we observe that the performance of the former is always lower thanthe one achieved by the latter (considering the median of the distributions, -4%of F-Measure and -8% of AUC-ROC). This indicates that the intensity indexcan provide much higher benefits in change prediction than existing metricsthat capture other smell-related information. Nevertheless, in some cases theantipattern metrics defined by Taba et al. [110] can give complementary infor-mation with respect to the intensity index, opening the possibility to obtainstill better performance by considering both metric types. Our claim is sup-ported by the overlap analysis shown in Table 7 and computed on the set ofchange-prone and smelly classes correctly classified by the two models. While43% of the instances are correctly classified by both the models, a consistentportion of instances are classified only by SM + Int. model (35%) or by themodel using the antipattern metrics (22%). Consequently, this means that thesmell-related information taken into account by the SM + Int. and SM + Ant.Metrics models are orthogonal and complement each other.

The observations made above were also confirmed from a statistical point ofview. Indeed, the intensity-including prediction model consistently appearedin the top Scott-Knott ESD rank in terms of AUC-ROC, meaning that itsperformance was statistically higher than the baselines in most of the cases(40 projects out of 43).

Contribution in Process-based Models. Also in this case the additionof the intensity index in the model defined by Elish et al. [27] improved itsperformance with respect to the basic model (PM). Indeed, the overall medianvalue of F-Measure increased of 15%, i.e., F-Measure of PM + Int. is 77%while that of PM is 62%. An interesting aspect to discuss in this case isrelated to the ability of the intensity-including model to increase both precisionand recall with respect to the basic model. This is, for instance, the caseof Apache Ivy 2, where PM reaches 61% of precision and 49% of recall; byadding the intensity index, the prediction model increases its performances to76% (+15%) in terms of precision and 77% (+28%) of recall, demonstratingthat a better characterization of the classes having design problems can helpin obtaining more accurate predictions.

Looking at the baseline model that includes the antipattern metrics, wenotice that it provides improvements when compared to the basic one. How-ever, such improvements are still minor in terms of F-Measure (64% vs 62%)and thus we can confirm that the addition of the metrics proposed by Taba etal. [110] does not provide a relevant boost in the performance of basic changeprediction models. Similarly, the model based on such metrics is never ableto outperform the performance of the intensity-including one, being up to13% less performing. At the same time, it is worth reporting an interestingcomplementarity between the set of change-prone and smelly classes correctly


classified by “PM + Int.” and by the Basic + Ant. Metrics (see Table 7),i.e., the two models correctly capture the change-proneness of different codeelements.

The statistical analyses confirmed the findings discussed above. Indeed, thelikelihood to be ranked at the top by the Scott-Knott ESD test is always higherfor the model including the intensity index. At the same time, the antipatternmetrics-including model was confirmed to provide a slight statistical benefitsthan the basic one (they are ranked in the same cluster in 88% of the cases).

Contribution in Developer-Related Model. Finally, the results for thistype of model is similar to the one discussed above. Indeed, the addition ofthe intensity in DCBM [24] allows the model to reach a median F-Measure of75% and an AUC-ROC of 74%, respectively. When compared to the standardmodel DCBM the intensity-including one performs better (i.e., +7% in termsof median F-Measure and +6% in terms of median AUC-ROC). For instance, inthe Apache Synapse 1.2 project the “DCBM + Int.” obtains an F-Measureand AUC-ROC 12% and 13%, respectively, higher than DCBM. The resultholds for all the systems in our dataset, meaning that the addition of theintensity always provides improvements with respect to the baseline.

Comparing the performance of “DCBM + Int.” with the model that in-cludes the antipattern metrics [110], we observe that the F-Measure of theformer is on average 6% higher than the latter; the better performance of theintensity-including model is also confirmed when considering the AUC-ROC,which is 4% higher. Nevertheless, also in this case we found some comple-mentarities in the correct predictions done by these two models (see Table7): indeed, only 44% of instances are correctly caught by both the models,while 31% of them are only captured by “DCBM + Int.” and 25% only by“DCBM + Ant. Metrics”. The Scott-Knott ESD test confirmed our obser-vations. The likelihood of the intensity-including model to be ranked at thetop is always higher than the other models. At the same time, the antipat-tern metrics-including models were confirmed to provide statistically betterperformance than the basic models in 67% of the considered systems.

Table 7: Overlap analysis between the model including the intensity index andthe model including the antipattern metrics.

ModelsInt. ∩ Int. \ Ant. \Ant.% Ant.% Int.%

SM [128] 43 35 22PM [27] 47 38 15DCBM [24] 44 31 25


RQ1 - To what extent does the intensity index improve the per-formance of existing change prediction models? The addition of theintensity index [85] as a predictor of change-prone components increasesthe performance of the baseline change prediction models in terms of F-Measure up to 10%.

RQ2 - How does the model including the intensity index as pre-dictor compare to a model built using antipattern metrics? Theprediction models that include the antipattern metrics [110] only performslightly better than the basic models, while they have lower performancethan the intensity-including ones. However, we observed interesting com-plementarities between the set of change-prone and smelly classes correctlyclassified by the models that include intensity index and the models withantipattern metrics, which highlight the possibility to achieve higher per-formance through a combination of smell-related information.

4.2 RQ3: Analyzing the gain provided by the intensity index with respect toother predictors

In this section we analyze the results of Gain Ratio Feature Evaluation al-gorithm [92] in order to understand how important the predictors composingthe different models considered in this study are, with the aim to evaluate thepredictive power of the intensity index when compared to the other predictors.

Table 8 shows the gain provided by the different predictions employed in thestructural metrics-based change prediction model, while Table 9 reports theresults for the process-based model and Table 10 those for the DCBM model.In particular, the tables report the ranking of the predictors based on theirimportance within the individual models through the values of the mean andthe standard deviation (computed by considering the results obtained on thesingle systems) of the expected reduction in entropy caused by partitioning theprediction model according to a given predictor. In addition, we also providethe likelihood of the predictor to be in the top-rank by the Scott-Knott ESDtest, i.e., the percentage of times a predictor was statistically better thanthe others. The following subsections discuss our findings considering eachprediction model individually.

Gain Provided to Structural-based Models [128]. The results in Table 8shows how Coupling Between Objects (CBO) is the metric having the highestpredictive power, with an average reduction of entropy of 0.66 and a standarddeviation of 0.09. The Scott-Knott ESD test statistically confirmed the im-portance of the predictor, since the information gain given by the metric wasstatistically higher than other metrics in 82% of the cases. It is worth notingthat this result is in line with previous findings in the field [12, 85, 128] which


Table 8: Gain Provided by Each Metric To The SM Prediction Model.

Metric Mean St. Dev.SK-ESD

LikelihoodCBO 0.66 0.09 82RFC 0.61 0.05 77Intensity 0.49 0.13 75LOC 0.44 0.11 55LCOM 0.43 0.12 51Antipattern Complexity Metric 0.42 0.12 41Antipattern Recurrence Length 0.31 0.05 32Average Number of Antipatterns 0.22 0.10 21DIT 0.13 0.02 3

showed the relevance of coupling information for the maintainability of soft-ware classes. Looking at the ranking, we also noticed that Response For a Class(RFC), Lines of Code (LOC), and Lack of Cohesion of Methods (LCOM3) ap-pear to be relevant. On the one hand, this is still in line with previous findings[12, 85, 128]. On the other hand, it is also important to note that our resultsseem to reconsider the role of code size for assessing change prediction. Inparticular, unlike the findings by Zhou et al. [128] on the confounding effect ofsize, we discovered that LOC can be an important predictor to discriminatechange-prone classes. This may be due to the large dataset exploited in thisstudy, which allows a higher level of generalizability. The Scott-Knott ESDtest confirmed that these metrics are among the most powerful ones.

As for the variable of interest, i.e., the intensity index, we could observethat it is the feature providing the third highest gain in terms of reductionof entropy, as it has a value of Mean and Standard Deviation of 0.49 and0.13, respectively. Looking at the results of the statistical test, we observedthat the intensity index is ranked on the top by the Scott-Knott ESD in 49%of the cases, thus confirming the high predictive power of the metric. Thesefindings lead to two main observations. In the first place, the intensity indexis a relevant variable and for this reason can provide high benefits for theprediction of change-prone classes (as also observed in RQ1). Secondly, andperhaps more interesting, the intensity index can be more powerful than otherstructural metrics from which it is derived: in other words, a metric mixingtogether different structural aspects to measure how severe a code smell isseems to be more meaningful than the individual metrics used to derive theindex.

As for the antipattern metrics, we observed that all of them appear to beless relevant than the intensity index. Somehow, this confirms the results ofRQ2, where we showed that adding them to change prediction models resultsin a limited improvement with respect to the baseline. At the same time, it isworth noting that ACM (i.e., Antipattern Complexity Metric) may sometimesprovide a notable contribution. While the average gain is 0.42, the standarddeviation is 0.12: this means that the entropy reduction can be up to 0.54, as inthe case of the Apache-synapse-2.3. Thus, this result seems to suggest thatthis metric has some potential for effectively predicting the change-proneness


of classes. The other two antipattern metrics, i.e., Average Number of Antipat-terns (ANA) and Antipattern Recurrence Length (ARL), instead, only providea partial contribution in the reduction of entropy of the model. Indeed, themean are 0.31 and 0.22, respectively, with a standard deviation that is neverabove 0.1. The Scott-Knott ESD test statistically confirmed the findings: in-deed, ACM was a top predictor in 41% of the datasets, as opposed to ANAand ARL metrics which appeared as statistically more powerful than othermetrics only in 32% and 21% of the cases. Finally, Depth Inheritance Treewas the less powerful metrics in the ranking, and the Scott-Knott ESD testranked it at the top in only 3% of the cases.

Table 9: Gain Provided by Each Metric To The PM Prediction Model.


LikelihoodBOC 0.56 0.05 75FRCH 0.55 0.06 64Intensity 0.44 0.08 61WCD 0.42 0.11 55Antipattern Complexity Metric 0.41 0.04 54LCA 0.33 0.07 33CHO 0.28 0.03 31Antipattern Recurrence Length 0.24 0.05 25Average Number of Antipatterns 0.09 0.03 2CSB 0.07 0.01 1TACH 0.02 0.01 1

Gain Provided to Process-based Models [27]. Regarding the processmetric-based model considered in this study, the results are similar to thestructural model. Indeed, from Table 9 it is evident that the intensity indexrepresents a good predictor; the value of the mean is 0.44 and it is a top pre-dictors in 61% of the dataset. It appears to be the third most powerful featureof the model, just behind the Birth of a Class and Frequency of Changes. Onthe one hand, this ranking is pretty expected, as the top two features are thosewhich fundamentally characterize the notion of process-based change predic-tion proposed by Elish et al. [27]. On the other hand, our findings report thatthe intensity index can effectively complement the process metrics present inthe model, i.e., a structural-based indicator seems to be orthogonal with re-spect to the other basic features. As for the antipattern metrics, also in thiscase ACM turned to be a potentially good predictor in 54% of the dataset,while ANA and ARL are top predictors only in 25% and 2% of the dataset,respectively. At the bottom of the ranking there are other basic metrics likeChanges since the Birth and Total Amount of Changes: this confirms previ-ous findings [20] reporting that the overall number of previous changes cannotproperly model the change-proneness of classes.Gain Provided to Developer-related factors [24]. Looking at the rankingof the features of the DCBM model, we can still confirm the results discussedso far. Indeed, the intensity index is the second most relevant factor, just


Table 10: Gain Provided by Each Metric To The DCBM Prediction Model.


LikelihoodSemantic Scattering 0.76 0.07 95Intensity 0.74 0.05 94Structural Scattering 0.72 0.05 91Antipattern Complexity Metric 0.66 0.04 78Antipattern Recurrence Length 0.31 0.02 44Average Number of Antipatterns 0.11 0.03 21

behind the semantic scattering: its mean is 0.74, and the Scott-Knott ESDtest indicated the intensity index as top predictor in 94% of the dataset. Itis interesting to note that the index provides a higher contribution than thestructural scattering, meaning that the combination from which it is derivedcan provide a higher entropy reduction with respect to a structural metric thatcomputes how far the classes touched by developers in a certain time windoware. Regarding the antipattern metrics, the results are similar to those of theother models considered; the ACM provided a pretty high quantity of addi-tional information to the model (mean=0.66), being ranked at top predictorsin 78% of the dataset. Instead, the means for ARL and ANA are notably lower(0.31 and 0.11, respectively), appearing as the least important features.

As a more general observation, it is worth noting that the values of meaninformation gain of both the intensity index and ACM are much higher (≈0.20more) for this model when compared to the structural- and process metrics-based models. This indicates that those metrics can provide a much higherinformation than the other models: this can be due to the limited number offeatures employed by this model, which makes the additional metrics moreuseful to predict the change-proneness of classes.

All in all, we can confirm again that the intensity index provides a stronginformation gain to all the change prediction models considered in the study,together with ACM from the group of antipattern metrics. This possibly high-lights how their combination could provide further improvements in the con-text of change prediction.

RQ3 - What is the gain provided by the intensity index to changeprediction models when compared to other predictors? The inten-sity index provides a notable information gain in all the considered pre-diction models. At the same time, a metric of complexity of the changeprocess involving code smells seems to provide further additional informa-tion, highlighting the possibility to obtain even better change predictionperformance when mixing different smell-related information.


4.3 RQ4: The performance of the combined smell-aware bug prediction model

The results achieved in the previous research questions highlight the possibilityto build a combined change prediction model that takes into account smell-related information besides the structural, process, and developer-related met-rics. For this reason, in the context of RQ4, we assessed the feasibility of acombined solution and evaluated its performance with respect to the resultsachieved by the stand-alone models that only rely on structural, process, ordeveloper-related features. As explained in Section 3, to come up with thecombined model we firstly put together all the features of the considered mod-els and then applied a feature selection algorithm to discard irrelevant fea-tures. Starting from an initial set of 18 metrics, this procedure discarded DIT,CSB, TACH, and ANA. Thus, the combined model comprises 14 metrics: be-sides most of the basic structural, process, and developer-related predictors,the model includes three smell-related metrics such as (i) intensity index, (ii)ACM, and (iii) ARL.

Fig. 3: Overview of the value of F-Measure and AUC-ROC of the CombinedModel

Figure 3 shows the boxplots reporting the distributions of F-Measure andAUC-ROC related to the smell-aware combined change prediction model. Tofacilitate the comparison with the models exploited in the context of RQ1

and RQ2, we also report boxplots depicting the best models coming from ourprevious analyses, i.e., SM + Int., PM + Int., and DCBM + Int.

Looking at the figures, it seems clear that the combined model has bet-ter performances than all the baseline approaches that exploit individual fea-tures. Indeed, the median F-Measure reaches 88%, meaning that it is 18%,11%, and 13% more accurate than SM + Int., PM + Int., and DCBM + Int.,respectively. These results hold when considering the AUC-ROC, where thecombined model was able to boost the performance at least 10% with respectto the basic models that include the intensity. As an example, in the Apache

Xalan 2.5 project the best stand-alone model (the “SM + Int” in this case)had an F-Measure close to 73%, while the mixture of features provided by thecombined model allowed to reach an F-Measure of 93%. As expected, the re-


sults are statistically significant, as the devised smell-aware change predictionmodel appeared in top Scott-Knott ESD rank in 98% of the cases.

On the one hand, these strong results confirm previous findings on the im-portance to combine different predictors of source code maintainability [20, 23].On the other hand, we can claim that smell-related information [85, 110] im-proves the capabilities of change prediction models, allowing them to performmuch better than other existing models.

RQ4 - What is the performance of a combined change predic-tion model that includes smell-related information? The devisedsmell-aware change prediction model performs better than all the baselineapproaches considered in the paper, with an F-Measure up to 20%.

5 Threats to Validity

In this section we discuss possible threats affecting our results and how wemitigated them.

5.1 Threats to Construct Validity

As for threats related to the relationship between theory and observation,they are mainly related to the independent variables used and the datasetexploited. As for the former, we selected state of the art change predictionmodels based on a different set of basic features, i.e., structural, process, anddeveloper-related metrics, that capture different characteristics of source code.The selection was mainly driven by recent results [20] that showed that theconsidered models are (i) accurate in the detection of the change-pronenessof classes and (ii) complementary to each other, thus correctly identifying dif-ferent sets of change-prone classes. All in all, this selection process allowedus to test the contribution of smell-related information in different contexts.As for the dataset, we relied on a publicly available source previously built[85]. Of course, we cannot exclude possible imprecisions in the computation ofthe dependent variable. Still in this category, we adopted JCodeOdor [32]to identify code smells and assign to them a level of intensity. Our choice wasdriven by previous results [85] that showed, on the same dataset, that thistool has a high accuracy (i.e., F-Measure=80%). Despite this performance,the tool still identifies 154 false positives and 94 false negatives among the 43considered systems. To make the set of code smells as close as possible to theground truth, in our study we manually elaborated on the output of the toolby (i) setting to zero the intensity index of the false positive instances, and (ii)discarding the false negatives, i.e., the instances for which we could not assignan intensity value. Since this manual process is not always feasible, we alsoevaluated the effect of including false positive and false negative instances in


the construction of the change prediction models. More specifically, we re-ranthe analyses performed in Section 3 and validated the performance of the ex-perimented models when including the false positive instances using the samemetrics used to assess the performance of the other prediction models (i.e., F-Measure and AUC-ROC). Our results report that these models always performbetter than other models that do not include any smell-related information,while they are slightly less performing (-3% in terms of median F-Measure)than those built discarding the false positive instances.

In the second place, we evaluated what is the impact of including falsenegative instances. Their intensity index is, by definition, equal to zero: as aconsequence, they were considered in the same way as non-smelly classes. Theresults of our analyses showed that the intensity-including models still producebetter results than those of the baselines, as they boosted their median F-Measure of ≈4%. At the same time, we observed a decrement of 2% in terms ofF-Measure with respect to the performance obtained by the prediction modelsbuilt discarding false negatives.

As a final step, we also considered the case where both false positives andfalse negatives are incorporated in the experimented models. Our findings re-port that the Basic + Intensity models have a median F-Measure 2% lowerthan the models where the false positive and false negative instances were fil-tered out. At the same time, they were still better than basic models (medianF-Measure=+6%). Thus, we could conclude that a fully automatic code smelldetection process still provides better performance than existing change predic-tion models. This result is, in our opinion, extremely valuable as it indicatesthat practitioners can adopt automatic code smell detectors without the needof manually investigating the candidates they give as output.

It is worth remarking that the choice of considering code smell severityrather than the simple presence/absence of smells was driven by the conjec-ture that the severity can give a more fine-grained information on how much adesign problem is harmful for a certain source code class. To verify this conjec-ture, we conducted a further analysis aimed at establishing the performanceof the experimented models where considering a boolean value reporting thepresence of code smells rather than their intensity. As expected, our find-ings reported that the models relying on the intensity were more powerfulthan those based on the boolean indication of the smell presence. This fur-ther confirms the idea behind this paper, i.e., code smell intensity can improvechange-proneness prediction.

Our observations might still have been threatened by the presence of a largenumber of code smell co-occurrences [86, 125], which might have biased theintensity level of the smelly classes of our dataset. To account for this aspect,we measured the percentage of classes in our dataset affected by more thanone smell: we only found that 8% of the classes, on average, contained morecode smells. As a consequence, we can claim that the problem of co-occurrenceis rather limited in our study.

A further threat to construct validity relates to the dataset exploited in ourempirical study. In this regard, we rely on a publicly available oracles from the


PROMISE repository [67], however we performed some preliminary operationsto ensure data quality and robustness, by employing the data preprocessingalgorithm implementing the guidelines by Shepperd et al. [103] aimed at re-moving noisy and/or erroneous data items.

5.2 Threats to Conclusion Validity

Threats to conclusion validity refer to the relation between treatment andoutcome. When evaluating the change prediction models we computed well-established metrics such as F-Measure and AUC-ROC. Furthermore, we sta-tistically verified the differences in the performance achieved by the differentexperimented models using the Scott-Knott statistical test [98] and Cliff’sDelta [40] for measuring the effect size. Moreover, we analyzed to what extentthe intensity index is important with respect to the other metrics by analyzingthe gain provided by the addition of the severity measure in the model. Finally,to ensure that the experimented models did not suffer from multi-collinearity,we adopted the variance inflation factors function [74] to discard non-relevantvariables from the considered features.

Still in this category there is a possible threat related to the validationmethodology exploited. As shown by Tantithamthavorn et al. [114], ten-foldcross validation might provide unstable results because of the effect of randomsplitting. To deal with this issue, we repeated the 10-fold cross validation100 times: in this way, we drastically removed the bias due to the validationstrategy.

5.3 Threats to External Validity

Threats in this category mainly concern the generalization of results. In thiscase we analyzed a large set of 43 releases of 14 software systems coming fromdifferent application domains and with different characteristics (size, numberof classes, etc.). Another threat in this category regards the choice of thebaseline models. However, we evaluated the contribution of the smell-relatedinformation in the context of change prediction models widely used in thepast [20, 27, 128] that take into account predictors of different nature, i.e., ,product, process, and developer-related metrics. However, we are aware thatour study is based on systems developed in Java only, and therefore futureinvestigations aimed at corroborating our findings on a different set of systemswould be worthwhile.

6 Conclusion

Change-prone classes represent code components that tend to change more of-ten, because of their importance for the business logic of a system or because


not properly designed by developers. To help practitioners in keeping mainte-nance and evolution activities under control, many change prediction modelshave been defined in literature [20, 27, 128]. Following the findings of previousstudies [52, 83] reporting the impact of code smells on the change-pronenessof classes, in this paper we aimed at investigating the impact of smell-relatedinformation for the prediction of change-prone classes. We first conducted alarge empirical study on 43 releases of 14 software systems and evaluated thecontribution of the intensity index proposed by Arcelli Fontana et al. [4] withinexisting change prediction models based on structural- [128], process-[27], anddeveloper-related [24] metrics. We also compared the gain provided by theintensity index with the one given by the so-called antipattern metrics [110],i.e., metrics capturing historical aspects of code smells.

The results indicated how the addition of the intensity index as a predic-tor of change-prone components increases the performance of baseline changeprediction models by an average of 10% in terms of F-Measure (RQ1). More-over, the intensity index can boost the performance of such models more thanstate of the art smell-related metrics such as those defined by Taba et al.[110], even though we observed some complementarities between the modelsexploiting different information on code smells(RQ2). Based on these results,we built a combined smell-aware change prediction model that takes into ac-count product, process, developer- and smell-related information(RQ4). Thekey results showed how the combined model provides a consistent boost interms of F-Measure, which goes up to 20% more.

Our findings represent the main input for our future research agenda: wefirst aim at further testing the usefulness of the devised model in an industrialsetting. Furthermore, we plan to perform a fine-grained analysis into the roleof each smell type independently on the change prediction power.

References

1. Abbes M, Khomh F, Gueheneuc YG, Antoniol G (2011) An empiri-cal study of the impact of two antipatterns, blob and spaghetti code,on program comprehension. In: Software maintenance and reengineering(CSMR), 2011 15th European conference on, IEEE, pp 181–190

2. Abdi M, Lounis H, Sahraoui H (2006) Analyzing change impact in object-oriented systems. In: 32nd EUROMICRO Conference on Software Engi-neering and Advanced Applications (EUROMICRO’06), IEEE, pp 310–319

3. Aniche M, Treude C, Zaidman A, van Deursen A, Gerosa M (2016) Satt:Tailoring code metric thresholds for different software architectures. In:2016 IEEE 16th IEEE International Working Conference on Source CodeAnalysis and Manipulation (SCAM)

4. Arcelli Fontana F, Ferme V, Zanoni M, Roveda R (2015) Towards a pri-oritization of code debt: A code smell intensity index. In: Proceedings ofthe Seventh International Workshop on Managing Technical Debt (MTD


2015), IEEE, Bremen, Germany, pp 16–24, in conjunction with ICSME2015

5. Arcelli Fontana F, Mantyla MV, Zanoni M, Marino A (2016)Comparing and experimenting machine learning techniques for codesmell detection. Empirical Software Engineering 21(3):1143–1191,DOI 10.1007/s10664-015-9378-4, URL http://dx.doi.org/10.1007/

s10664-015-9378-4

6. Arisholm E, Briand LC, Foyen A (2004) Dynamic coupling measurementfor object-oriented software. IEEE Transactions on Software Engineering30(8):491–506

7. Bacchelli A, Bird C (2013) Expectations, outcomes, and challenges ofmodern code review. In: Proc. of the International Conference on Soft-ware Engineering (ICSE), IEEE, pp 712–721

8. Baeza-Yates R, Ribeiro-Neto B (1999) Modern Information Retrieval.Addison-Wesley

9. Baeza-Yates R, Ribeiro-Neto B, et al (1999) Modern information re-trieval, vol 463. ACM press New York

10. Baeza-Yates RA, Ribeiro-Neto B (1999) Modern Information Retrieval.Addison-Wesley Longman Publishing Co., Inc.

11. Bansiya J, Davis CG (2002) A hierarchical model for object-orienteddesign quality assessment. IEEE Trans Softw Eng 28(1):4–17

12. Basili VR, Briand LC, Melo WL (1996) A validation of object-orienteddesign metrics as quality indicators. IEEE Trans Softw Eng 22(10):751–761

13. Bell RM, Ostrand TJ, Weyuker EJ (2013) The limited impact of individ-ual developer data on software defect prediction. Empirical Softw Engg18(3):478–505

14. Bieman JM, Straw G, Wang H, Munger PW, Alexander RT (2003) Designpatterns and change proneness: an examination of five evolving systems.In: Proc. Int’l Workshop on Enterprise Networking and Computing inHealthcare Industry, pp 40–49, DOI 10.1109/METRIC.2003.1232454

15. Bottou L, Vapnik V (1992) Local learning algorithms. Neural Comput4(6):888–900

16. Boussaa M, Kessentini W, Kessentini M, Bechikh S, Ben Chikha S (2013)Competitive coevolutionary code-smells detection. In: Search Based Soft-ware Engineering, Lecture Notes in Computer Science, vol 8084, SpringerBerlin Heidelberg, pp 50–65

17. Briand LC, Wust J, Lounis H (1999) Using coupling measurement for im-pact analysis in object-oriented systems. In: Proc. Int’l Conf. on SoftwareMaintenance (ICSM), IEEE, pp 475–482

18. Catolino G, Ferrucci F (2018) Ensemble techniques for software changeprediction: A preliminary investigation. In: Machine Learning Techniquesfor Software Quality Evaluation (MaLTeSQuE), 2018 IEEE Workshop on,IEEE, pp 25–30

19. Catolino G, Palomba F, Arcelli Fontana F, De Lucia A, Ferrucci F,Zaidman A (2018) Improving change prediction models with code smell-

http://dx.doi.org/10.1007/s10664-015-9378-4

http://dx.doi.org/10.1007/s10664-015-9378-4


related information - replication package - https://figshare.com/s/

f536bb37f3790914a32a

20. Catolino G, Palomba F, De Lucia A, Ferrucci F, Zaidman A (2018) En-hancing change prediction models using developer-related factors. Jour-nal of Systems and Software

21. le Cessie S, van Houwelingen J (1992) Ridge estimators in logistic regres-sion. Applied Statistics 41(1):191–201

22. D’Ambros M, Bacchelli A, Lanza M (2010) On the impact of design flawson software defects. In: Quality Software (QSIC), 2010 10th InternationalConference on, pp 23–31, DOI 10.1109/QSIC.2010.58

23. DAmbros M, Lanza M, Robbes R (2012) Evaluating defect prediction ap-proaches: a benchmark and an extensive comparison. Empirical SoftwareEngineering 17(4):531–577

24. Di Nucci D, Palomba F, De Rosa G, Bavota G, Oliveto R, De Lucia A(2017) A developer centered bug prediction model. IEEE Trans on SoftwEngineering p to appear

25. Di Nucci D, Palomba F, Tamburri DA, Serebrenik A, De Lucia A (2018)Detecting code smells using machine learning techniques: are we thereyet? In: 2018 IEEE 25th International Conference on Software Analysis,Evolution and Reengineering (SANER), IEEE, pp 612–621

26. Di Penta M, Cerulo L, Gueheneuc YG, Antoniol G (2008) An empiricalstudy of the relationships between design pattern roles and class changeproneness. In: Proc. Int’l Conf. on Software Maintenance (ICSM), IEEE,pp 217–226, DOI 10.1109/ICSM.2008.4658070

27. Elish MO, Al-Rahman Al-Khiaty M (2013) A suite of metrics for quanti-fying historical changes to predict future change-prone classes in object-oriented software. Journal of Software: Evolution and Process 25(5):407–437

28. Eski S, Buzluca F (2011) An empirical study on object-oriented metricsand software evolution in order to reduce testing costs by predictingchange-prone classes. In: Proc. Int’l Conf Software Testing, Verificationand Validation Workshops (ICSTW), IEEE, pp 566–571

29. Fluri B, Wuersch M, PInzger M, Gall H (2007) Change distilling: Treedifferencing for fine-grained source code change extraction. IEEE Trans-actions on software engineering 33(11)

30. Fontana FA, Zanoni M (2017) Code smell severity classification usingmachine learning techniques. Knowledge-Based Systems 128:43–58

31. Fontana FA, Zanoni M, Marino A, Mantyla MV (2013) Code smell detec-tion: Towards a machine learning-based approach. In: Software Mainte-nance (ICSM), 2013 29th IEEE International Conference on, pp 396–399,DOI 10.1109/ICSM.2013.56

32. Fontana FA, Ferme V, Zanoni M, Roveda R (2015) Towards a priori-tization of code debt: A code smell intensity index. In: 2015 IEEE 7thInternational Workshop on Managing Technical Debt (MTD), IEEE, pp16–24

https://figshare.com/s/f536bb37f3790914a32a

https://figshare.com/s/f536bb37f3790914a32a


33. Fontana FA, Ferme V, Zanoni M, Yamashita A (2015) Automatic metricthresholds derivation for code smell detection. In: Proceedings of theSixth international workshop on emerging trends in software metrics,IEEE Press, pp 44–53

34. Fontana FA, Dietrich J, Walter B, Yamashita A, Zanoni M (2016) An-tipattern and code smell false positives: Preliminary conceptualizationand classification. In: 2016 IEEE 23rd International Conference on Soft-ware Analysis, Evolution, and Reengineering (SANER), vol 1, pp 609–613, DOI 10.1109/SANER.2016.84

35. Fowler M, Beck K, Brant J, Opdyke W, Roberts D (1999) Refactoring:Improving the Design of Existing Code. Addison-Wesley

36. Gatrell M, Counsell S (2015) The effect of refactoring on change andfault-proneness in commercial c# software. Science of Computer Pro-gramming 102(0):44 – 56, DOI http://dx.doi.org/10.1016/j.scico.2014.12.002, URL http://www.sciencedirect.com/science/article/pii/

S0167642314005711

37. Ghotra B, McIntosh S, Hassan AE (2015) Revisiting the impact of clas-sification techniques on the performance of defect prediction models. In:Proceedings of the International Conference on Software Engineering,IEEE, pp 789–800

38. Girba T, Ducasse S, Lanza M (2004) Yesterday’s weather: Guiding earlyreverse engineering efforts by summarizing the evolution of changes. In:Proc. Int’l Conf. Softw. Maintenance (ICSM), IEEE, pp 40–49

39. Grissom RJ, Kim JJ (2005) Effect sizes for research: A broad practicalapproach, 2nd edn. Lawrence Earlbaum Associates

40. Grissom RJ, Kim JJ (2005) Effect sizes for research: A broad practicalapproach, 2nd edn. Lawrence Earlbaum Associates

41. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH(2009) The weka data mining software: An update. SIGKDD ExplorNewsl 11(1):10–18, DOI 10.1145/1656274.1656278, URL http://doi.

acm.org/10.1145/1656274.1656278

42. Hall T, Beecham S, Bowes D, Gray D, Counsell S (2011) Developing fault-prediction models: What the research can show industry. IEEE Software28(6):96–99

43. Han AR, Jeon SU, Bae DH, Hong JE (2008) Behavioral dependency mea-surement for change-proneness prediction in uml 2.0 design models. In:32nd Annual IEEE International Computer Software and ApplicationsConference, IEEE, pp 76–83

44. Han AR, Jeon SU, Bae DH, Hong JE (2010) Measuring behavioral de-pendency for improving change-proneness prediction in uml-based designmodels. Journal of Systems and Software 83(2):222–234

45. Hassan AE (2009) Predicting faults using the complexity of code changes.In: Int’l Conf. Software Engineering (ICSE), IEEE, pp 78–88

46. John GH, Langley P (1995) Estimating continuous distributions inbayesian classifiers. In: Eleventh Conference on Uncertainty in ArtificialIntelligence, Morgan Kaufmann, San Mateo, pp 338–345

http://www.sciencedirect.com/science/article/pii/S0167642314005711

http://www.sciencedirect.com/science/article/pii/S0167642314005711

http://doi.acm.org/10.1145/1656274.1656278

http://doi.acm.org/10.1145/1656274.1656278


47. Kennedy J (2011) Particle swarm optimization. In: Encyclopedia of ma-chine learning, Springer, pp 760–766

48. Kessentini M, Vaucher S, Sahraoui H (2010) Deviance from perfection isa better criterion than closeness to evil when identifying risky code. In:Proceedings of the IEEE/ACM International Conference on AutomatedSoftware Engineering, ACM, ASE ’10, pp 113–122

49. Kessentini W, Kessentini M, Sahraoui H, Bechikh S, Ouni A (2014) Acooperative parallel search-based software engineering approach for code-smells detection. IEEE Transactions on Software Engineering 40(9):841–861, DOI 10.1109/TSE.2014.2331057

50. Khomh F, Di Penta M, Gueheneuc YG (2009) An exploratory study ofthe impact of code smells on software change-proneness. In: 2009 16thWorking Conference on Reverse Engineering, IEEE, pp 75–84

51. Khomh F, Vaucher S, Gueheneuc YG, Sahraoui H (2009) A bayesianapproach for the detection of code and design smells. In: Proceedings ofthe International Conference on Quality Software (QSIC), IEEE, HongKong, China, pp 305–314

52. Khomh F, Di Penta M, Gueheneuc YG, Antoniol G (2012) An ex-ploratory study of the impact of antipatterns on class change-and fault-proneness. Empirical Softw Engg 17(3):243–275

53. Kohavi R (1995) The power of decision tables. In: 8th European Confer-ence on Machine Learning, Springer, pp 174–189

54. Kumar L, Behera RK, Rath S, Sureka A (2017) Transfer learning forcross-project change-proneness prediction in object-oriented software sys-tems: A feasibility analysis. ACM SIGSOFT Software Engineering Notes42(3):1–11

55. Kumar L, Rath SK, Sureka A (2017) Empirical analysis on effectivenessof source code metrics for predicting change-proneness. In: ISEC, pp 4–14

56. Lanza M, Marinescu R (2006) Object-Oriented Metrics in Practice: UsingSoftware Metrics to Characterize, Evaluate, and Improve the Design ofObject-Oriented Systems. Springer

57. Le Cessie S, Van Houwelingen JC (1992) Ridge estimators in logisticregression. Applied statistics pp 191–201

58. Lehman MM, Belady LA (eds) (1985) Program Evolution: Processes ofSoftware Change. Academic Press Professional, Inc.

59. Lu H, Zhou Y, Xu B, Leung H, Chen L (2012) The ability of object-oriented metrics to predict change-proneness: a meta-analysis. Empiricalsoftware engineering 17(3):200–242

60. Malhotra R, Bansal A (2015) Predicting change using software metrics:A review. In: Int’l Conf. on Reliability, Infocom Technologies and Opti-mization (ICRITO), IEEE, pp 1–6

61. Malhotra R, Khanna M (2013) Investigation of relationship betweenobject-oriented metrics and change proneness. International Journal ofMachine Learning and Cybernetics 4(4):273–286

62. Malhotra R, Khanna M (2014) A new metric for predicting softwarechange using gene expression programming. In: Proc. Int’l Workshop on


Emerging Trends in Software Metrics, ACM, pp 8–1463. Malhotra R, Khanna M (2017) Software change prediction using voting

particle swarm optimization based ensemble classifier. In: Proc. of theGenetic and Evolutionary Computation Conference Companion, ACM,pp 311–312

64. Marinescu C (2014) How good is genetic programming at predictingchanges and defects? In: Int’l Symp. on Symbolic and Numeric Algo-rithms for Scientific Computing (SYNASC), IEEE, pp 544–548

65. Marinescu R (2004) Detection strategies: Metrics-based rules for detect-ing design flaws. In: Proceedings of the International Conference on Soft-ware Maintenance (ICSM), pp 350–359

66. Marinescu R (2012) Assessing technical debt by identifying design flawsin software systems. IBM Journal of Research and Development 56(5):9–1

67. Menzies T, Caglayan B, Kocaguneli E, Krall J, Peters F, Turhan B (2012)The promise repository of empirical software engineering data

68. Miryung Kim NN Tom Zimmermann (2014) An empirical study of refac-toring challenges and benefits at Microsoft. IEEE Transactions on Soft-ware Engineering 40

69. Mkaouer MW, Kessentini M, Bechikh S, Cinneide MO (2014) A ro-bust multi-objective approach for software refactoring under uncertainty.In: International Symposium on Search Based Software Engineering,Springer, pp 168–183

70. Moha N, Gueheneuc YG, Duchien L, Meur AFL (2010) Decor: A methodfor the specification and detection of code and design smells. IEEE Trans-actions on Software Engineering 36(1):20–36

71. Morales R, Soh Z, Khomh F, Antoniol G, Chicano F (2016) On the use ofdevelopers’ context for automatic refactoring of software anti-patterns.Journal of Systems and Software (JSS)

72. Munro MJ (2005) Product metrics for automatic identification of “badsmell” design problems in java source-code. In: Proceedings of the Inter-national Software Metrics Symposium (METRICS), IEEE, p 15

73. Murphy-Hill E, Black AP (2010) An interactive ambient visualizationfor code smells. In: Proceedings of the 5th international symposium onSoftware visualization, ACM, pp 5–14

74. O’brien RM (2007) A caution regarding rules of thumb for variance in-flation factors. Quality & Quantity 41(5):673–690

75. Olbrich SM, Cruzes DS, Sjberg DIK (2010) Are all code smells harmful?a study of god classes and brain classes in the evolution of three opensource systems. In: in: Intl Conf. Softw. Maint, pp 1–10

76. Oliveto R, Khomh F, Antoniol G, Gueheneuc YG (2010) Numerical sig-natures of antipatterns: An approach based on B-splines. In: Proceedingsof the European Conference on Software Maintenance and Reengineering(CSMR), IEEE, pp 248–251

77. Palomba F, Zaidman A (2017) Does refactoring of test smells inducefixing flaky tests? In: Software Maintenance and Evolution (ICSME),2017 IEEE International Conference on, IEEE, pp 1–12


78. Palomba F, Bavota G, Di Penta M, Oliveto R, De Lucia A, PoshyvanykD (2013) Detecting bad smells in source code using change history in-formation. In: Automated software engineering (ASE), 2013 IEEE/ACM28th international conference on, IEEE, pp 268–278

79. Palomba F, Bavota G, Di Penta M, Oliveto R, De Lucia A (2014) Do theyreally smell bad? a study on developers’ perception of bad code smells. In:Software maintenance and evolution (ICSME), 2014 IEEE internationalconference on, IEEE, pp 101–110

80. Palomba F, Bavota G, Di Penta M, Oliveto R, Poshyvanyk D, De LuciaA (2015) Mining version histories for detecting code smells. IEEE Trans-actions on Software Engineering 41(5):462–489, DOI 10.1109/TSE.2014.2372760

81. Palomba F, Lucia AD, Bavota G, Oliveto R (2015) Anti-pattern de-tection: Methods, challenges, and open issues. Advances in Computers95:201–238, DOI 10.1016/B978-0-12-800160-8.00004-8

82. Palomba F, Panichella A, Zaidman A, Oliveto R, De Lucia A (2016)A textual-based technique for smell detection. In: Proceedings of the24th International Conference on Program Comprehension (ICPC 2016),IEEE, Austin, USA, p to appear

83. Palomba F, Bavota G, Di Penta M, Fasano F, Oliveto R, De Lucia A(2017) On the diffuseness and the impact on maintainability of codesmells: a large scale empirical investigation. Empirical Software Engi-neering pp 1–34

84. Palomba F, Panichella A, Zaidman A, Oliveto R, De Lucia A (2017) Thescent of a smell: An extensive comparison between textual and structuralsmells. Transactions on Software Engineering

85. Palomba F, Zanoni M, Fontana FA, De Lucia A, Oliveto R (2017) To-ward a smell-aware bug prediction model. IEEE Transactions on SoftwareEngineering

86. Palomba F, Bavota G, Di Penta M, Fasano F, Oliveto R, De Lucia A(2018) A large-scale empirical study on the lifecycle of code smell co-occurrences. Information and Software Technology 99:1–10

87. Palomba F, Zaidman A, De Lucia A (2018) Automatic Test Smell Detec-tion using Information Retrieval Techniques. In: International Conferenceon Software Maintenance and Evolution (ICSME), IEEE, p to appear

88. Parnas DL (1994) Software aging. In: Proc. of the International Confer-ence on Software Engineering (ICSE), IEEE, pp 279–287

89. Peer A, Malhotra R (2013) Application of adaptive neuro-fuzzy inferencesystem for predicting software change proneness. In: Advances in Com-puting, Communications and Informatics (ICACCI), 2013 InternationalConference on, IEEE, pp 2026–2031

90. Peng CYJ, Lee KL, Ingersoll GM (2002) An introduction to logistic re-gression analysis and reporting. The Journal of Educational Research96(1):3–14

91. Peters R, Zaidman A (2012) Evaluating the lifespan of code smells usingsoftware repository mining. In: Software Maintenance and Reengineering


(CSMR), 2012 16th European Conference on, IEEE, pp 411–41692. Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106,

DOI 10.1023/A:1022643204877, URL http://dx.doi.org/10.1023/A:

1022643204877

93. Ratiu D, Ducasse S, Gırba T, Marinescu R (2004) Using history informa-tion to improve design flaws detection. In: Proceedings of the EuropeanConference on Software Maintenance and Reengineering (CSMR), IEEE,pp 223–232

94. Romano D, Pinzger M (2011) Using source code metrics to predictchange-prone java interfaces. In: Proc. Int’l Conf. Software Maintenance(ICSM), IEEE, pp 303–312

95. Rosenblatt F (1961) Principles of Neurodynamics: Perceptrons and theTheory of Brain Mechanisms. Spartan Books

96. Rumbaugh J, Jacobson I, Booch G (2004) Unified Modeling LanguageReference Manual, The (2Nd Edition). Pearson Higher Education

97. Sahin D, Kessentini M, Bechikh S, Deb K (2014) Code-smell detectionas a bilevel problem. ACM Trans Softw Eng Methodol 24(1):6:1–6:44,DOI 10.1145/2675067

98. Scott AJ, Knott M (1974) A cluster analysis method for grouping meansin the analysis of variance. Biometrics 30:507–512

99. Sharafat AR, Tahvildari L (2007) A probabilistic approach to predictchanges in object-oriented software systems. In: Proc. Conf. on Softw.Maintenance and Reengineering (CSMR), IEEE, pp 27–38

100. Sharafat AR, Tahvildari L (2008) Change prediction in object-orientedsoftware systems: A probabilistic approach. Journal of Software 3(5):26–39

101. Shepperd M, Song Q, Sun Z, Mair C (2013) Data quality: Some commentson the nasa software defect datasets. IEEE Transactions on SoftwareEngineering 39(9):1208–1215

102. Shepperd M, Bowes D, Hall T (2014) Researcher bias: The use of machinelearning in software defect prediction. IEEE Transactions on SoftwareEngineering 40(6):603–616, DOI 10.1109/TSE.2014.2322358

103. Shepperd M, Bowes D, Hall T (2014) Researcher bias: The use of machinelearning in software defect prediction. IEEE Trans on Softw Engineering40(6):603–616

104. Sjoberg DI, Yamashita A, Anda BC, Mockus A, Dyba T (2013) Quanti-fying the effect of code smells on maintenance effort. IEEE Transactionson Software Engineering (8):1144–1156

105. Soetens QD, Demeyer S, Zaidman A, Perez J (2016) Change-based testselection: An empirical evaluation. Empirical Softw Engg 21(5):1990–2032

106. Song Q, Jia Z, Shepperd M, Ying S, Liu J (2011) A general softwaredefect-proneness prediction framework. IEEE Transactions on SoftwareEngineering 37(3):356–370, DOI 10.1109/TSE.2010.90

107. Spadini D, Palomba F, Zaidman A, Bruntink M, Bacchelli A (2018) OnThe Relation of Test Smells to Software Code Quality. In: International

http://dx.doi.org/10.1023/A:1022643204877

http://dx.doi.org/10.1023/A:1022643204877


Conference on Software Maintenance and Evolution (ICSME), IEEE, pto appear

108. Spinellis D (2005) Tool writing: A forgotten art? IEEE Software (4):9–11109. Stone M (1974) Cross-validatory choice and assessment of statistical pre-

dictions. Journal of the royal statistical society Series B (Methodological)pp 111–147

110. Taba SES, Khomh F, Zou Y, Hassan AE, Nagappan M (2013) Predictingbugs using antipatterns. In: Proceedings of the 2013 IEEE InternationalConference on Software Maintenance, IEEE Computer Society, Washing-ton, DC, USA, ICSM ’13, pp 270–279, DOI 10.1109/ICSM.2013.38, URLhttp://dx.doi.org/10.1109/ICSM.2013.38

111. Taibi D, Janes A, Lenarduzzi V (2017) How developers perceive smellsin source code: A replicated study. Information and Software Technology92:223–235

112. Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2016) Au-tomated parameter optimization of classification techniques for defectprediction models. In: Proceedings of the 38th International Confer-ence on Software Engineering, ACM, New York, NY, USA, ICSE ’16,pp 321–332, DOI 10.1145/2884781.2884857, URL http://doi.acm.org/

10.1145/2884781.2884857

113. Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2016)Comments on researcher bias: The use of machine learning in software de-fect prediction. IEEE Transactions on Software Engineering 42(11):1092–1094, DOI 10.1109/TSE.2016.2553030

114. Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2017) Anempirical comparison of model validation techniques for defect predic-tion models. IEEE Trans Softw Eng 43(1):1–18, DOI 10.1109/TSE.2016.2584050, URL https://doi.org/10.1109/TSE.2016.2584050

115. Tempero E, Anslow C, Dietrich J, Han T, Li J, Lumpe M, Melton H,Noble J (2010) The qualitas corpus: A curated collection of java code forempirical studies. In: Proc. 17th Asia Pacific Software Eng. Conf., IEEE,Sydney, Australia, pp 336–345, DOI 10.1109/APSEC.2010.46

116. Theodoridis S, Koutroumbas K (2008) Pattern recognition. IEEE Trans-actions on Neural Networks 19(2):376–376

117. Tsantalis N, Chatzigeorgiou A (2009) Identification of move methodrefactoring opportunities. IEEE Transactions on Software Engineering35(3):347–367

118. Tsantalis N, Chatzigeorgiou A, Stephanides G (2005) Predicting theprobability of change in object-oriented systems. IEEE Transactions onSoftware Engineering 31(7):601–614

119. Tufano M, Palomba F, Bavota G, Oliveto R, Di Penta M, De Lu-cia A, Poshyvanyk D (2015) When and why your code starts to smellbad. In: Proceedings of the 37th International Conference on SoftwareEngineering-Volume 1, IEEE Press, pp 403–414

120. Tufano M, Palomba F, Bavota G, Di Penta M, Oliveto R, De Lucia A,Poshyvanyk D (2016) An empirical investigation into the nature of test

http://dx.doi.org/10.1109/ICSM.2013.38

http://doi.acm.org/10.1145/2884781.2884857

http://doi.acm.org/10.1145/2884781.2884857

https://doi.org/10.1109/TSE.2016.2584050


smells. In: Proceedings of the 31st IEEE/ACM International Conferenceon Automated Software Engineering, pp 4–15

121. Tufano M, Palomba F, Bavota G, Oliveto R, Di Penta M, De Lucia A,Poshyvanyk D (2017) When and why your code starts to smell bad (andwhether the smells go away). IEEE Transactions on Software Engineeringp to appear.

122. Vidal S, Guimaraes E, Oizumi W, Garcia A, Pace AD, Marcos C (2016)On the criteria for prioritizing code anomalies to identify architecturalproblems. In: Proceedings of the 31st Annual ACM Symposium on Ap-plied Computing, ACM, pp 1812–1814

123. Vidal SA, Marcos C, Dıaz-Pace JA (2016) An approach to prioritize codesmells for refactoring. Automated Software Engineering 23(3):501–532

124. Y Freund LM (1999) The alternating decision tree learning algorithm.In: Proceeding of the Sixteenth International Conference on MachineLearning, pp 124–133

125. Yamashita A, Moonen L (2013) Exploring the impact of inter-smell re-lations on software maintainability: An empirical study. In: Proceedingsof the International Conference on Software Engineering (ICSE), IEEE,pp 682–691

126. Yamashita AF, Moonen L (2012) Do code smells reflect important main-tainability aspects? In: Proceedings of the International Conference onSoftware Maintenance (ICSM), IEEE, pp 306–315

127. Zhao L, Hayes JH (2011) Rank-based refactoring decision support: twostudies. Innovations in Systems and Software Engineering 7(3):171

128. Zhou Y, Leung H, Xu B (2009) Examining the potentially confound-ing effect of class size on the associations between object-oriented met-rics and change-proneness. IEEE Transactions on Software Engineering35(5):607–623

Documents

arXiv:1905.10889v1 [cs.SE] 26 May 2019