Understanding the Rationale for Updating a Function's Comment

Understanding the Rationale for Updating a Function’s Comment

Haroon Malik, Istehad Chowdhury, Hsiao-Ming Tsou, Zhen Ming Jiang, Ahmed E. HassanSchool of Computing, Queen’s University, Canada

2

Documentation is vital for the successful evolution of a software

system

3

Why understand the rationale for updating a comment

4

Because…

Reduce efforts to understand code

Reduce maintenance cost

Prevent bugs

Increase reliability

5

Likelihood of updating a commentFunction 1.function incrementValue ($val) {

return ($val++);}

Function 2.

function processInput($val) { //loop 11 times.

for (i=0;i<10;i++) {// loop executes for the upper bound of J

for (j=0;j<10;j++) { $val = ($val | i) << 2; $val = $val & $j << 2; } } return $val;}

Which

One?

6

Likelihood of updating a commentFunction 1.function incrementValue ($val) {

return ($val++);}

Function 2.

function processInput($val) { //loop 11 times.

for (i=0;i<10;i++) {// loop executes for the upper bound of J

for (j=0;j<10;j++) { $val = ($val | i) << 2; $val = $val & $j << 2; } } return $val;}

It Depends!!

7

• Modified function characteristics (8 attributes)– Long vs. short functions– Long vs. short function names– Well-documented functions– Complex vs. simple functions (# of control statements)

• Change characteristics (8 attributes)– Complex vs. simple change– Large vs. small change

• Time and code ownership characteristics (9)– Do habits change over time? Weekends vs. weekends– Same developer that changed it last time

Study Dimensions

8

Comment Update?

YES ? No?

Modeled as a classification problem

9

Measuring Performance

True ClassClassified AsYES NO

YES a bNO c d

We measure overall misclassification rate = (b+c)/(a+b+c+d)

10

• Explainable model• Resistant to noise• Correlated attributes• Minimum configuration

Need

Random ForestsProject Comment

update history

Data Set

12

Random ForestsProject Comment

update history

Data Set

Random Sample

Random TressYes No No

No

Vote

Prediction

13

Finding Top Attributes

• Sensitivity Analysis for particular attribute• Randomly change the value in all samples• Re-classify and compare performance–Drop in performance is relative to the

importance of the attribute

Case Study

• Used 4 open source projects with over 39 years of development:• PostgreSQL, FreeBSD, Gcluster and GCC

• Conducted 5 experiments• 1 for each dimension• 1 for all attributes of each project• 1 for total combined attributes of all projects

http://images.google.ca/imgres?imgurl=http://www.kensaengineering.com/self-build/images/case-studies-images.jpg&imgrefurl=http://www.kensaengineering.com/self-build/case-studies.htm&h=654&w=432&sz=198&hl=en&start=25&um=1&usg=__J7WpbAKzf0biHGY4SjAarhqy0eI=&tbnid=OfCOObHHNYoyhM:&tbnh=138&tbnw=91&prev=/images?q=case+study&start=20&ndsp=20&um=1&hl=en&sa=N

15

Exp. #1 Characteristics of changed function

• Intuition– Modification to complex functions are trickier and

more likely to introduce integration bugs

• Findings– Likelihood of comment update is higher in

functions • With a large number of comments • That are complex

16

Exp. #2 Characteristics of the change

• Intuition– More extensive and complex changes will increase

the probability that a comment will get updated

• Findings– Likelihood of comment update is higher for

changes • That are bug fixes• With a large number of changed dependencies• Which increase the complexity of a function (control statements)

17

Exp. #3 Change time and code-ownership

• Intuition– To see if time has any impact on a developer tendency

to update a comment– To highlight the relation of a function with developer

• Findings– Likelihood of comment update

• Depends on Weekday: Developers are reluctant to update comment on certain weekdays

• Does not depend on developer: non-creator of function will update too

18

Exp. #4 All attributes

• Intuition– To find general trend towards all attributes instead of

specific trend per dimension

• Findings– The top attributes are consistent across projects– The top attributes are from the changed function and

change characteristics dimension• Number of changed dependencies• Percentage of changed dependencies• Total number of comments

19

Exp. #5 All Projects

• Intuition– Determine the most influential attributes across

all projects

• Added an extra attribute “Project Name”• Findings– Project name did not bubble up as an important

attribute

20

How well we did ?

21

Number Speaks

• Performance of classifier improves with combining data from all projects. Over all misclassification rate ~ 20%

Conclusion

23

Random Forests

Training set …

1

2

n

n random cases

Classification Algorithm

n classifiers

1

2

3

3

n




Test set

…

L1

L2

L3

Ln

n labels

Lvote

Software

Understanding the Rationale for Updating a Function's Comment