Code Hot Spot: A Tool for Extraction and Analysis of Code ... · PDF fileCode Hot Spot: A Tool for Extraction and Analysis of ... including who changed a module ... [1, 2, 6, and 11]

Code Hot Spot: A Tool for Extraction and Analysis of Code Change History

Will Snipes and Brian Robinson Industrial Software Systems

ABB Corporate Research Raleigh, NC USA

Will.Snipes, Brian.P.Robinson at us.abb.com

Emerson Murphy-Hill North Carolina State University

Department of Computer Science Raleigh, NC USA

Emerson at csc.ncsu.edu

Commercial software development teams have limited time available to focus on improvements to their software. These teams need a way to quickly identify areas of the source code that would benefit from improvement, as well as quantifiable data to defend the selected improvements to management. Past research has shown that mining configuration management systems for change information can be useful in determining faulty areas of the code. We present a tool named Code Hot Spot, which mines change records out of Microsoft’s TFS configuration management system and creates a report of hot spots. Hot spots are contiguous areas of the code that have higher values of metrics that are indicators of faulty code. We present a study where we use this tool to study projects at ABB to determine areas that need improvement. The resulting data have been used to prioritize areas for additional code reviews and unit testing, as well as identifying change prone areas in need of refactoring.

Keywords: Decision Support; Metrics; Refactoring; Verification; Quality; Configuration Management

I. INTRODUCTION Commercial software development teams have limited time

to spend on improvement activities. Consequently, these teams need to be able to identify where to focus their improvements and to justify the selected improvement areas to management. The selection of improvement areas should, ideally, be based on objective data from past and present development projects. In practice, many teams focus improvement on areas they personally believe to be most fault prone. In a previous study, we found that different stakeholders had different beliefs about which areas should be improved first [10]. These differences lead to issues in selecting which improvement areas to focus on and were overcome by conducting a large scale, manual classification of historical defects. While this approach may be very successful for large organizational improvements, the effort required to manually classify the historical defects is not feasible for development teams to use frequently. It also only focuses on defects and not on other code-level improvements, such as refactoring.

Development teams need a faster way to determine areas of source code that need improvement (i.e. are fault or change prone). Researchers have shown that data from open source configuration management systems are useful to identify source files that are fault-prone, generate models that predict

defects, and assess the risk of a given change being defective [4, 7, 12, 14, 16]. Change records are created for each change in a product’s configuration management system (CMS) and, over the life of a software product, thousands of these records exist. Each of these records contains information about the product at that point in time, including who changed a module and why it was changed.

Analysis of these change records can yield information about quality issues in the software, such as hot spots. Code hot spots represent large concentrations in the source code of specific measures, such as historical defects, frequent code changes, or complexity. These hot spots may indicate areas of the source code that need improvement. For example, a hot spot of past post-release defects may represent code that is not tested adequately before release. Or, an area of code with a hot spot of frequent changes may indicate the need for refactoring.

In this paper, we introduce Code Hot Spot, a tool to mine change records out of code repositories in Team Foundation Server, a widely used commercial CMS developed by Microsoft, and to provide analysis to development teams on higher level issues and trends in the code. We use this tool to locate hot spots for specific measures in the source code of 3 major products in ABB and 1 open source product, Mozilla. We then study these hot spots to determine how well they represent issues that need improvement in the source code. Finally, we evaluate the tool to determine its usefulness to support decisions for improving areas of the product code.

The key contributions of this work are:

A tool for mining CMS repositories including TFS that calculates change history metrics, and

The evaluation of the usefulness of change history metrics for making improvement decisions in open source and closed source projects

II. RELATED WORK Research has shown that data extracted from change history

records are useful for identifying defect-prone files or modules. Researchers in academia and industry use change history data to describe characteristics such as the relative fault-proneness of the modules.

A. Early Research There is a long history of pursuing correlations of various

data (i.e. complexity metrics and change metrics) with fault-prone files, going back to the 1990’s from papers published by J. Hudepohl et al. on studies of large telecommunications software [3]. This line of research explored the application of modeling techniques and unique challenges of predicting rare events (the occurrence of a customer reported defect in a source module is rare) using change history and code complexity data. Our work extends this early work by adding new change history metrics and interfaces with TFS.

Pai and J. Bechta Dugan use the C-K Suite of metrics and a pure Bayesian analysis to predict of fault-prone modules and fault content in modules [8]. The C-K Suite of metrics describes the inter-dependencies and architecture of the code. Our work uses change history metrics rather than metrics that describe the code architecture to identify fault-prone files.

Change history data can uncover dependencies between source files that are not apparent through static analysis or architecture analysis methods. A. T. T. Ying, et al. explore in [14] whether files that change together during maintenance have a dependency or relationship. This technique helps identify dependencies across languages that are not possible through static analysis. The researchers provide this input to developers by recommending other source files that may be relevant to the module currently under investigation. This is a different use of change history data than our purpose; however, it may be an area for future study.

B. Risk of a Specific Change A more recently pursued objective is to determine the risk

of a specific change or determine the insertion point of a bug given the change history. S. Kim, J. Whitehead, and Y. Zhang perform a machine learning analysis of code deltas and change history using the code itself in delta form, complexity, change records, and defect classification [4]. They use Baysean methods to separate prior information like complexity from the change itself. Their models determine whether a specific change is likely to contain an injected bug.

Mockus and D. M. Weiss show another example of evaluating the defect risk of a specific change in an industrial setting [7]. This paper describes a model that predicts which changes pose a risk for future defects. The key parameters in the model are the type of change (defect or new feature), developer experience, breadth of changes made (by number of affected files), and the size of the change in added, deleted and unmodified code. The result targets defect removal efforts for frequent incremental maintenance updates or patches.

Our analysis seeks to identify the defect prone files that have a high potential for defects in the future. The above papers seek to identify where a bug was inserted in the code or the risk of a specific change injecting a defect. Our approach does not consider aspects of a proposed or pending change to the software, rather the characteristics of the file and its change history. Our approach is more static in nature saying there is a high risk in a particular code area, rather than targeting a specific change instance.

C. Links to Developer Knowledge or Networks A new area of research attempts to tie change history to

personnel information. T. Fritz, et al. [1,2] explore developer’s experience in a Degree of Knowledge (DOK) model that expresses a developer's experience with a code element. The DOK model includes who authored (created) a code element and who viewed or edited code elements. DOK compares favorably with developers’ personal opinion of who the experts are for a particular code element.

D. Schuler and T. Zimmermann in [11] propose another method for identifying developers who are knowledgeable about an area of code. They use analysis of change history to determine developers who committed code and deeply inspect those changes to determine methods that the developers either created or used. They find that creating a method call determines that a developer has knowledge of the method being called even if it is in a file that the developer did not modify. The advantage of this analysis is that files with few check-ins can have expertise assigned to them.

A. Meneely et al. investigate “developer networks” exploring the effects on code quality of multiple people in the organization participating in the maintenance and evolution of software files [6]. Thus far the research has provided some interesting results such as identifying certain “super” developers who, when connected with change history of a source file, significantly reduce the defect risk.

In contrast to this previous work above [1, 2, 6, and 11] our study seeks to define change history metrics of each file that correlate with defect risk for files without considering aspects of developer collaboration or knowledge of the code. We do include a metric defined in [6] for the number of unique developers who modify a file.

D. Text Based Coupling Studies of OO coupling and cohesion using text based

analysis of comments provide additional opportunities for correlating with defect history. Újházi, B.; Ferenc, R.; Poshyvanyk, D.; Gyimóthy, T. analyze two coupling and cohesion metrics based on analysis of comment text and variables and compare them with many existing OO [17]. Marcus, A.; Poshyvanyk, D.; Ferenc, R. introduce the concept of Conceptual Cohesion of Classes using textual analysis of comments and variables to identify the connection between classes [18].

Although we are only considering a more limited set of metrics that are applicable to both OO and procedural programming styles, there may be an opportunity for future research to explore these coupling and cohesion metrics.

E. Published Tools There have only been a few tools published that extract and

analyze change history data. A tool named HATARI [12] by J. Sliwerski, T. Zimmermann, and A. Zeller assesses the risk of making changes to a slice of a program. The tool determines that the change is risky based on defect change history for each slice of code. It determines this by investigating the change history of the program slice to find previous instances of defect

injections in the same code slice. The HATARI tool integrates with Eclipse and highlights using green and shades of red to indicate the degree of risk of changing a particular line of code based on the change history.

Another tool, ROSE [16] by T. Zimmermann et al., uses a change history to assess what other files may be coupled with the module being changed. ROSE uses change history to determine files with nearly simultaneous modifications in the change history. It then suggests the additional files to the developer as they investigate a particular file. ROSE seeks to prevent overlooking affected files when making modifications.

An Eclipse plug-in named APFEL [15] by T. Zimmermann mines CVS repositories to identify “fine-grained changes”. It tokenizes the changes made in the source code and makes comparisons for changes using these tokens. Case studies show APFEL is useful to identify crosscutting concerns where common functionality is scattered around the program, identify pairs of variables used together, and identify variable name changes.

The EMERALD tool [3] described by J. Hudepohl et al. identifies the risk of defects for a specific source file using complexity metrics in a predictive model. EMERALD analyzes the static complexity (i.e. Halstead metrics) of the code for historic releases and compares them with the defect rates for each file. The relationship, captured in a regression model, provides a risk of defect rating for existing and new files based on their measured complexity.

The Code Hot Spot tool, like EMERALD and HITARI provides a color-coded feedback to developers on the defect risk. Code Hot Spot extends their capabilities to the Team Foundation Server environment. It does not analyze changes at the line level as HITARI or AFPEL do for CVS and Eclipse. It does not include analysis of files that change together that ROSE creates, but could provide this information to developers in the future. It does not identify cross-cutting concerns as APFEL does and the data set does not include information on fine-grained changes.

III. APPROACH Our research goal is to use change history data to assess the

defect prone source files within a product managed by Team Foundation Server. Although every configuration management system maintains change history, the format of the information provided by the CMS is not typically designed for creating change metrics. Therefore we developed a tool called Code Hot Spot to extract, analyze, and present change history metrics within the ecosystem of Microsoft’s development environment. The development of Code Hot Spot faced two key challenges. The processing of change history data required tool support to in interpret textual data to create useful metrics on change history for each file. Second, the presentation of the information must be engaging and easily interpreted by architects and developers.

Figure 1 below shows the three modules that make up Code Hot Spot, the extraction module, the analysis module, and the presentation module that must address each of the challenges above.

Figure 1. Architecture of Code Hot Spot Tool

A. Data Extraction Algorithm The extraction module shown at top left in Figure 1 uses the

TFS API to extract the desired data record of all change sets for an entire product. The extraction module queries change set data for each file in the product and stores them in a flat table format. A change set in TFS contains a unique record of a submission (or check-in) of one or more files to the TFS source repository. A change set record contains the following data:

Unique identifier of the change set

Branch and file path

Committer is the developer making the change

Owner is the developer responsible for the file

Creation Date of the commit time in the CMS

Comment is the comment text supplied by the developer describing the reason for the change

The TFS VersionControlServer class provides a rich API that includes the QueryHistory method to query the change history for an element in the CMS including all sub directories and files below that element. This powerful command returns change set data for all elements in scope of the query. The QueryHistory method returns a record for each change set. We translate the record for each change set and file into a flat format so that file path and change set id must be used to uniquely identify a record. This format is passed to the analysis module to perform counting operations on the raw extracted data.

For the Mozilla analysis, a customized output format from the Mercurial history command generates the history record needed by the Analysis Module. The import function of the Analysis Module to process the Mozilla change history.

B. Calculation of Metrics The analysis module in Figure 1 processes the raw change

set history to extract metrics for each file in the system. Metrics are generated on a per-file basis so that the values for each file in the system can be compared.

We prototyped the analysis algorithm in the R [9] package for statistical computing before implementing in C# for the presentation module. Using R allowed development of metrics in conjunction with exploring their correlation with the desired Count of Defect Changes metric. We built well

correlated metrics into the presentation module to aid developers.

Count of Changes: the number of change sets for the file. When calculating the Count of Changes for a file, several pre-processing steps filtered the expansive change history into relevant meaningful changes to a file. The relevant changes include those where a person has edited the file. Many changes in the CMS were for merge or branch activities that mainly create a record of an automated operation to combine or diverge versions. Branch records are easily filtered by removing changes classified as Branch. Merge records sometimes include manual edits to the file and sometimes reference a defect ID in the change comments. When merge records included edits, they were kept for the analysis set, automated merges were discarded. The portion of merges with edits was less than 5% of the total person-made changes.

Count of Defect Changes is the number of change sets for a file that we could identify as defect fixes based on the comment text for the change set. To calculate the Count of Defect Changes metric we faced a text processing challenge where comment text must be parsed to determine whether a change was made to fix a defect. Addressing this challenge required establishing text processing rules using pattern matching. Each product has a specific defect tracking system with a unique identifier for each defect. In the comment text for the change, developers would put the Defect ID preceded by a word or symbol. Frequently that was something like “fixed #12345 …”. To further complicate matters, other numbers were present in the comment text that may have represented a TFS Changeset identifier with the same number of digits as a Defect ID. Therefore, the numbers extracted with pattern matching were matched with the list of defects from the defect tracking system. In one product, we compared the comment text with the defect title to confirm whether the change was actually for the cited defect.

In the Mozilla history, the developers consistently used the word “bug” to indicate a fix was being made and cited a defect identifier. Changes that contained words “Merge” and “Branch” were filtered from these data. Another class of changes appeared frequently, a “backing out” of a change because it created an affected failure. These changes were also filtered from these data because they should be replaced by a second bug fix for the affected failure issue. Overall in Mozilla, 1/3 of the changes were filtered as Merge, Branch, or back out.

Average Days Between Changes is the average number of days that elapse between all the change sets of the file. Calculating the Days Between Changes was performed on a per-file basis as the average of the difference between dates for each change.

The Days Since Last Change metric is the number of days between the date of the last change made to the file and the current date. The Days Since Last Defect Change is the difference in the current date and the date of the last defect change made to the file. When plotting correlations for Days Since Last Change and Days Since Last Defect Change, neither was correlated with the Count of Defect Changes or the Count of Changes. Therefore, these metrics are excluded from further discussion in the paper.

Count of Unique Committers for the file is the number of distinct developers who made changes to the file. This metric, defined in [6] was useful in that study, as an indicator of change prone files.

C. Presentation of Results The third challenge was providing an intuitive way to

present the risk analysis to developers. The presentation module (visual studio component in Figure 1) is a Visual Studio Add-In that delivers the risk assessment information to the developer. The presentation module allows the developer to query any file or directory in TFS, obtain the metric, and risk classification information for each file of interest. The presentation module classifies the meaning of the metrics for a particular file in the context of the system. It performs the classification according to the model developed in the analysis module. Section 5 describes an example classification

We also explored visualization techniques to address the challenge of providing intuitive communication of analysis to developers. A visualization tool, Mosai Code [5], developed by researchers at Kent State University, displays several aspects of metric data simultaneously in the context of the product architecture. The example visualization in Figure 2 shows the number of changes in each file as a color in the histogram. Each file is represented by a small colored block and the organization of the code structure (by directory) is clustered into blocks separated by thin lines. The Mosai Code tool allows the developer to drill down into a directory block to see further clustering of the displayed metric. This visualization ties the metric information to the structure of the source code directory tree. Mosai Code is designed to help architects and developers visually relate to the change information underlying the color coding. When higher value colors (reds) cluster in groups, the real hot spots intuitively appear to the developer.

Figure 2. Mosai Code Visualization Tool

IV. STUDY DESIGN We performed analysis on 3 ABB products and 1 open

source product to determine which files are most defect prone and which the most change prone. We included metrics defined by Code Hot Spot as well as static metrics for size and complexity. The Code Hot Spot metrics were calculated on the changes since each file was inserted into the CMS. For all products, the change history for each file is only for the time since it was inserted into the current CMS. Changes that occurred prior (i.e. in CVS for Mozilla) are excluded from these data. Overall these data represent about 2-3 years of the most recent history out of a 10+ year life for each product.

The Understand tool from Scientific Tool Works calculates the complexity metrics for the most recent release of the product in the CMS. A list of the metrics available in Understand can be found at the SCI Tools website1. With Understand there are several definitions for Cyclomatic Complexity available and various measures of size attributes for a file. In the Results Section 5, we discuss analysis of metrics produced by Understand in the context of our objective.

We studied the histogram for all metrics to determine their distributions. To explore related metrics, a Pearson plot (scatter plot) identifies metrics that are good candidates to correlate with the Count of Changes and Count of Defects. A Spearman Rank correlation determines the individual metrics with the best correlation to the Count of Defect Changes.

We identify thresholds of the correlated metrics that trigger a concern of an elevated risk for defects or elevated number of changes. The thresholds for each metric are based on ranges of the distribution and the correlation with the Count of Defect changes. The thresholds create a decision tree to provide a color-coded rating of each file’s risk. The threshold decision tree is evaluated for precision and recall against the count of defect changes metric.

A. Limitations a) Our tool supports only decisions about focusing

verification efforts and selecting candidate areas for refactoring.

b) Data for each product do not represent the entire history of each file. At some point in time, files were migrated from another CMS to the one whose change history we analyze here. This is true for Mozilla as well where the change history represents only the changes that occurred in the Mercurial CMS (since v1.9). This may make the correlations with Count of Defect Changes less accurate because older files may have their defect history hidden.

c) Data available from this implementation are limited to the integrity of the information stored in the configuration management system. In this case, that consists of a unique identifier for each change, the timestamp of the change, the branch and file identifier that was changed, who changed it, and the comments supplied by the author.

For example, the reference of a change set to a reason for

the change such as Defect ID’s is not validated by the configuration management system. This lack of validation makes the link between defects and change sets somewhat unreliable, thus limiting the ability to identify fault-prone files.

V. RESULTS Here we report Code Hot Spot’s analyses of the change

history for three significant industrial projects at ABB, Inc. and one at Mozilla.

A. Distribution of Changes and Defect Changes The bar chart in Figure 3 shows the distribution of files in

each product by their Count of Changes. The Count of Changes on the X axis is the number of changes to a source file for the first 9 changes with 10 and above in the last category. Note that these data include the history of the file since it was added to the CMS; therefore, the first change is the creation of the file in the CMS. The Y-axis shows the percent of the total number of files that have the value shown on the X axis for Count of Changes.

The distribution of change frequency in the products is similar except for the percent of files with only 2 changes. Products A and B have many more files with only one change than Product C and Mozilla. We can see that the code base of Product A and Product B is more stable overall.

.

Although not shown in the figure, we were surprised by the magnitude of changes at the tail end of the distribution. In Mozilla, for instance, there are 16 files with more than 500 changes each. While the other products do not approach that range, a few files in those produce did have a surprisingly high number of changes.

Figure 3. Frequency of Changes by File

1 SCI Tools website www.scitools.com

The following chart Figure 4 shows the distribution of the defect change count for each product. The Y axis is the percentage of the total number of files in each product that have defect counts shown on the X axis. The X axis shows the category of the count of defect changes for any given file. The scale is 1 defect change to 9 then 10 or more. Except for 10+, all percentages are not cumulative between categories. In these data, there is a significant difference between the percentage of files with 2 or more defects in Mozilla and the other three products. Product C also has a higher percentage of files with 3 or more defect changes than the other products.

Figure 4. Frequency of Defects by File

B. Correlation Analysis The Code Hot Spot analysis engine calculated all metrics,

and the R program created a pair-wise scatter chart for all metrics. All scatter charts presented in this paper were produced by ggplot2 [13]. We discuss the metrics with moderately strong to strong correlations in this section. Some metrics with weak correlations are discussed. Otherwise, the metrics we do not discuss did not have strong correlations for these products with Count of Defect Changes.

The relationship shown in Figure 5 shows the correlation plot between the Average Days Between Changes and the Percent of Max Defect Changes in Files for Product A. The other products had similar plots for these variables that were omitted for space reasons. There is a lower potential for defect fixes when the average days between changes is greater than 200. Although the correlation shown in Table 1 is weak and negative for most products, we consider it useful to place a limit on the number of expected defect fixes for a file once this metric is greater than 200. Other disciplines have referred to this concept as “escape velocity” concept where once a certain threshold of age is reached, the entity (living or not) has a longer than average life-span. Perhaps software files have an escape velocity related to their Days Between Changes that when exceeded, limits their total defect potential. An

investigation of the files that change less frequently may yield useful characteristics of construction patterns that lead to more stable software.

Figure 5. Relationship with Days Between Changes for Product A

The threshold for Mozilla is even more pronounced as shown in Figure 6 the cutoff of average Days Between Changes could be less than 100.

Figure 6. Relationship with Average Days Between Changes for Mozilla

The Figures 7 and 8 shows the relationship between the Count of Unique Committers for a file and the Percent of Max Defect Changes in Files. Product C shown in Figure 7 has a plot similar to the other products. The Spearman rank correlation shown in Table 1 ranges from 0.44 to 0.84 for each product indicating that the Count of Unique Committers is weakly to moderately correlated with the defect changes for a file. For Product C shown in Figure 7, the slope of the relationship is higher than for Mozilla, indicating that fewer developers are involved in the defect fix changes for the closed source software than for open source software. For Mozilla, the correlation is higher with an R-squared of 0.84. The Count of Unique Committers has a good R-squared value for all products, thus it a good candidate for inclusion in a defect risk prediction model. This is consistent with the findings by A. Meneely et al. in [6].

The effect of Unique Committers could be process related, particularly if there are separate maintenance and development teams. In Products A, B, and C there is no model of code ownership, meaning no one person is responsible for maintaining an area of the code. This promotes the idea that each additional change could have a different developer. Future work may explore whether the lack of a code ownership model leads to more defects than a model with strong single owner control of the code changes.

Figure 7. Relationship with Count of Unique Committers for Product C

Figure 8. Relationship with Count of Unique Committers for Mozilla

Next we present examples of metrics that have weak correlations with defect fix changes to files. Figure 9 below shows the percent of max defect changes as described by the Source Lines Of Code (SLOC) for the file. Figure 9 shows Product A, other products have similar plots. Note the correlation is around 0.35 (see Table 1), so we may conclude that there is no strong relationship between the size of a file and the number of defect fixes is accrues.

There is a rather large outlier in these data, a file with more than 36000 SLOC that represents a configuration file for the GUI. Eliminating this outlier only raises the correlation slightly.

Figure 9. Percent of Defect Changes vs. SLOC (Size) for Product A

Figure 10. Percent of Defect Changes vs. SLOC (Size) for Mozilla

The correlation of SLOC for Mozilla is also very low; however, it improves if very large outliers are removed. If files with more than 10000 SLOC are excluded from these data, the spearman rank correlation becomes 0.6. So in some products, the size of the file may be related to the number of defect fixes it receives.

Normalizing by SLOC, however, has a negative effect on the correlations studied. When the number of defect fixes is normalized by the size of the file in Mozilla, all correlations in previous charts lose their strength.

Correlations of defect change counts with complexity metrics also have very low descriptive power. Of all the complexity metrics produced by Understand, the best correlation was around 0.30 for Product A and 0.27 for Mozilla with the metric Sum of Cyclomatic Complexity (Strict definition) for each file (see Table 1).

Figure 11. Percent of Defect Chagnes vs. Sum of Cyclomatic Complexity for Product A

Figure 12. Percent of Defect Chagnes vs. Sum of Cyclomatic Complexity for Product B

In the preceding Figure 11 and 12, we see the scatter plots for Product A and B for Sum of Cyclomatic Complexity with the Defect Change Percent. In all the products, the pattern was similar to the correlations with SLOC. In fact, Sum of Cyclomatic Complexity is well correlated with SLOC for each product with spearman rank R-squared values exceeding 0.8.

We explored correlations for other variants of Cyclomatic Complexity metric such as Average or Max of both standard and strict definitions. These were less correlated with SLOC; however, none were more strongly correlated than Sum of Cyclomatic Complexity.

TABLE I. SPEARMAN RANK CORRELATION VALUES FOR EACH PRODUCT’S PERCENT OF MAX DEFECT CHANGES IN FILES METRIC

Correlation Product A

Product B

Product C

Mozilla

Average Days Between Changes

0.10 -0.27 -0.35 -0.35

Count of Unique Committers

0.54 0.44 0.55 0.84

Source Lines Of Code (SLOC)

0.35 0.22 0.27 0.38

Sum of Cyclomatic Complexity (Strict definition)

0.30 0.19 0.23 0.27

C. Risk Classification Model To simplify the metrics into information developers can use

to make decisions, we defined a set of thresholds that categorize the defect risk for each file. We chose a “stoplight” style risk rating of R, Y, or G (Red, Yellow, or Green), used by

previous tools [12,3], to communicate this decision information simply and directly.

The Risk category is based on metrics that have strong correlations with the Count of Defect Changes for each file. In the preceding analysis, these metrics were Number of Unique Developers, and Count of Changes. As suggested previously in the analysis, a threshold on the Average Days Between Changes is used to identify files that are not expected to accumulate a high number of defects. Classifications are evaluated using precision and recall to determine the best selection of thresholds for each product.

For Mozilla, the categorization thresholds are as follows:

G when Count of Changes is 1 or Average Days Between Changes >200

Y when Average Days Between Changes <=200 and (Unique Developers<=10 or Count of Changes <=6)

R when Average Days Between Changes <200 and (Unique Developers >10 or Count of Changes>6)

The precision and recall evaluation of this classification is shown in Table I. Thresholds for the desired values of the count of Defect Changes are shown with each classification. The precision for Green is low; however its recall is very high. The Green classification has many files with 1 defect change, so the Precision improves to 0.8 when the desired threshold includes files with 1 defect as “Green”. In this evaluation, those files fall into the Yellow classification. The recall for Yellow and Red are low partly because of overlap between the classifications. A higher recall can be achieved if Yellow and Red are combined, however, the range of Count of Defect Changes for files classified as Yellow is much more limited than those classified as Red. So, the classification model provides more information by separating the Yellow from the Red classified files.

TABLE II. PRECISION AND RECALL FOR MOZILLA CLASSIFICATION

Precision Recall Green is 0 defect 0.314 0.913 Yellow is >=1 and <4 defects 0.664 0.293 Red is >=4 defects 0.942 0.551

Figure 13 below shows a sample report generated by Code Hot Spot’s Visual Studio report plug-in for a different ABB product using a different classification model. The sample report shows the number of changes, number of defect fixes, and number of unique developers for each file. The last column shows the colorized risk classification for the file based on the change history metrics.

When we apply the risk information in development decisions, higher risk files receive extra effort for verification efforts such as code inspection and testing. When seeking ways to improve the software, high risk files are targeted for assessment of potential refactoring solutions. The refactoring targets restructuring the software to make it more maintainable so that defects become more apparent to developers.

Figure 13. Code Hot Spot Report

VI. CONCLUSION The development of a tool to extract change history

information from a TFS CMS succeeded in producing data that is useful for identifying hot spots in the code. Comparisons between metric distributions across multiple products shows the distributions are similar between closed source and open source products. The correlations of metrics for both closed source and open source products reveal metric indicators of hot spots that are common among the products studied. Static size and complexity metrics are not strongly correlated with the defect change counts for each file. A classification model using selected metrics from the change history shows good precision and recall for three levels of defect risk classification.

Developers can use the results in Code Hot Spot to make better decisions about handling change prone software files. They can use the information to prioritize verification activities such as code inspection allocating greater effort to risky files. Developers can also target efforts at code improvement such as refactoring or reengineering to make the code easier to maintain.

Areas for future work include more sophisticated modeling methods to increase the accuracy of the model. We plan to research methods to assess the impact of a change that recommend related code to change or alternative implementation designs. Additional information such as developer networks extracted from the change history may provide further insights into the relationships with defect changes. We plan to conduct a more detailed study of the effects of the relationship between the number of unique developers and the number of defects in different code ownership models. A time-series analysis of the change history my help identify early warning signals that a file will receive a higher number of defect fixes.

ACKNOWLEDGMENTS The authors wish to thank Dr. Vinay Augustine of ABB

Corporate Research for his contributions to this work. We also wish to thank the product groups of ABB who contributed data for this study and the Industrial Software Systems research program at ABB.

REFERENCES [1] T. Fritz, G. C. Murphy, and E. Hill, “Does a Programmer’s Activity

Indicate Knowledge of Code? ” in ESEC-FSE ’07: Proceedings of the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering. New York, NY, USA: ACM, 2007, pp. 341–350.

[2] T. Fritz, J. Ou, G. C. Murphy, and E. Murphy- Hill, “A Degree-of-Knowledge Model to Capture Source Code Familiarity,” in ICSE ’10: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering. New York, NY, USA: ACM, 2010, pp. 385–394.

[3] J. P. Hudepohl, S. J. Aud, T. M. Khoshgoftaar, E. B. Allen, and J. Mayrand, "Emerald: software metrics and models on the desktop," IEEE Software, vol. 13, no. 5, pp. 56-60, Sept 1996.

[4] S. Kim, J. Whitehead, and Y. Zhang, “Classifying Software Changes: Clean or Buggy? ” Software Engineering, IEEE Transactions on, vol. 34, no. 2, pp. 181–196, 2008.

[5] J. Maletic, D. Mosora, C. Newman, M. Collard, A. Sutton and B. Robinson. "MosaiCode: Visualizing Large Scale Software: A Tool Demonstration" in Proceedings of the International Workshop on Visualizing Software for Understanding and Analysis, 2011, To Appear.

[6] A. Meneely, L. Williams, W. Snipes, and J. Osborne, "Predicting failures with developer networks and social network analysis," in SIGSOFT '08/FSE-16: Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering. New York, NY, USA: ACM, 2008, pp. 13-23.

[7] A. Mockus and D. M. Weiss, “Predicting Risk of Software Changes,” Bell Labs Tech. J., vol. 5, no. 2, pp. 169–180, 2000.

[8] Pai and J. Bechta Dugan, “Empirical Analysis of Software Fault Content and Fault Proneness Using Bayesian Methods,” IEEE Transactions on

Software Engineering, vol. 33, no. 10, pp. 675–686, October 2007. (references)

[9] R Development Core Team (2010). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.

[10] B. Robinson, P. Francis, F. Ekdahl, “A Defect-Driven Process for Software Quality Improvement” in ESEM 2008: Proceedings of the Second ACM-IEEE international symposium of Empirical Software Engineering and Measurement, New York, NY, USA 2008

[11] D. Schuler and T. Zimmermann, “Mining Usage Expertise From Version Archives,” in MSR ’08: Proceedings of the 2008 international working conference on Mining software repositories. New York, NY, USA: ACM, 2008, pp. 121–124.

[12] J. Sliwerski, T. Zimmermann, and A. Zeller, “HATARI: Raising Risk Awareness,” Proceedings of the 10th European software engineering conference held jointly with 13th ACM SIGSOFT international symposium on Foundations of software engineering, pp. 107-110, 2005.

[13] H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009.

[14] A. T. T. Ying, G. C. Murphy, R. Ng, and M. C. Chu-Carroll, “Predicting Source Code Changes by Mining Change History,” IEEE Transactions on Software Engineering, vol. 30, no. 9, pp. 574–586, September 2004.

[15] T. Zimmermann, “Fine-Grained Processing of CVS Archives with APFEL,” in eclipse ’06: Proceedings of the 2006 OOPSLA workshop on eclipse technology eXchange. York, NY, USA: ACM, 2006, pp. 16–20.

[16] T. Zimmermann, A. Zeller, P. Weissgerber, and S. Diehl, “Mining Version Histories to Guide Software Changes,” IEEE Transactions on Software Engineering, vol. 31, no. 6, pp. 429–445, June 2005.

[17] Újházi, B.; Ferenc, R.; Poshyvanyk, D.; Gyimóthy, T.; , "New Conceptual Coupling and Cohesion Metrics for Object-Oriented Systems," Source Code Analysis and Manipulation (SCAM), 2010 10th IEEE Working Conference on , vol., no., pp.33-42, 12-13 Sept. 2010

[18] Marcus, A.; Poshyvanyk, D.; Ferenc, R.; , "Using the Conceptual Cohesion of Classes for Fault Prediction in Object-Oriented Systems," Software Engineering, IEEE Transactions on , vol.34, no.2, pp.287-300, March-April 2008

Documents

Code Hot Spot: A Tool for Extraction and Analysis of Code ... · PDF fileCode Hot Spot: A Tool for Extraction and Analysis of ... including who changed a module ... [1, 2, 6, and 11]