41
SOFTWARE ANALYSIS & INTELLIGENCE LAB Mining Development Repositories to Study the Impact of Collaboration on Software Systems Nicolas Bettenburg [email protected] 1 Wednesday, 11 April, 12

Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Embed Size (px)

DESCRIPTION

Talk given at the 2011 ESEC/FSE

Citation preview

Page 1: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

SOFTWARE ANALYSIS

& INTELLIGENCE LAB

Mining Development Repositories to Study the Impact of

Collaboration on Software Systems

Nicolas [email protected]

1Wednesday, 11 April, 12

Page 2: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Software Development is a Social Activity

Source Code stands in direct relation to organizational structure. [Conway:Datamation:1968]

Developers spent large part of work day communicating with fellow developers. [Begel:ICSE:2010]

2Wednesday, 11 April, 12

Page 3: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Communication is Critical for Success

Communication is the most referenced problem in distributed development.

[Bird:ACMComm:2009][Grinter:GROUP:1999]

3Wednesday, 11 April, 12

Page 4: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Research Hypothesis

“The collaboration between stakeholders impacts the code quality and the development

community of a software system.”

4Wednesday, 11 April, 12

Page 5: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Proposed Approach

I. Extraction of communication data

II. Study impact on software quality

III. Study impact on development community

5Wednesday, 11 April, 12

Page 6: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Proposed Approach

I. Extraction of communication data

II. Study impact on software quality

III. Study impact on development community

6Wednesday, 11 April, 12

Page 7: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Available Knowledge in Data

Version Control Systems Mailing Lists Issue Tracking Systems

7Wednesday, 11 April, 12

Page 8: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Available Knowledge in Data

Version Control Systems Mailing Lists Issue Tracking Systems

Communication Data

7Wednesday, 11 April, 12

Page 9: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Available Knowledge in Data

Version Control Systems Mailing Lists Issue Tracking Systems

Communication Data• Source Code Comments• Change-Log Messages• Developer Emails & Discussions• Support Dialogues

7Wednesday, 11 April, 12

Page 10: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

In this report, you have defined a parameter named blocksize,which is given a value of "7|D|1|D". In open script of data set, there are below lines code:

<script begin>token=Packages.java.util.StringTokenizer(params["blocksize"],"|");vec=new Packages.java.util.Vector();while(token.hasMoreTokens()){ vec.addElement(token.nextToken());}params["DateRange"]=java.lang.Integer.parseInt(vec.elementAt(0));</script end>

Since the value of params["blocksize"] is "7|D|1|D", vec.elementAt(0) is "7", and then it can not be parsed to int value. In 1.0.1,the value of params["blocksize"] might be 7|D|1|D, so it can be parsed to int value of 7.

Eclipse #150222

Extraction and processing of unstructured data is challenging. [MUD:Workshop:2010]

Communication Data Exists Mainly as Unstructured Data

8Wednesday, 11 April, 12

Page 11: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Mining Collaboration Data

[Bettenburg:ICPC:2011]

A Lightweight Approach to Uncover Technical Information in Unstructured Data

Nicolas Bettenburg, Bram Adams, Ahmed E. Hassan

Software Analysis and Intelligence Lab

Queen’s University

Kingston, Ontario, Canada

Email: {nicbet,bram,ahmed}@cs.queensu.ca

Michel SmidtDept. of Computer Science

University of BremenBremen, Germany

Email: [email protected]

A

b

s

t

r

a

c

t

—Developer communication through email, chat, or

issue report comments consists mostly of largely unstructured

data, i.e., natural language text, mixed with technical informa-

tion such as project-specific jargon, abbreviations, source code

patches, stack traces and identifiers. These technical artifacts

represent a valuable source of knowledge on the technical

part of the system, with a wide range of applications from

establishing traceability links to creating project-specific vo-

cabularies. However, the free-style delimiters between natural

language and technical content make the mining of technical

artifacts challenging. As a first step towards a general-purpose

technique to extracting all kinds of technical information

from unstructured data, we present a lightweight approach

to untangle technical artifacts and natural language text. Our

approach is based on existing spell checking tools, which are

well-understood, fast, readily available across platforms and

impartial to different kinds of technical artifacts. Through a

handcrafted benchmark, we demonstrate that our approach

is able to successfully uncover a wide range of technical

information in unstructured data.

K

e

y

w

o

r

d

s

-text mining, language analysis, unstructured data,

technical information.

I. INTRODUCTION

Every software system has a unique history of design

decisions, software changes, as well as development and

maintenance effort. This history is captured throughout the

development process in the variety of repositories used to

store data during the collaborative development process. As

this data contains the knowledge and rationale behind the

evolution of a software system, it is valuable for many differ-

ent fields, in particular program comprehension, and hence

should be made available to practitioners and researchers

alike.However, much of the information surrounding the devel-

opment process comes in the form of unstructured data [1],

which is conceptually different from the sources of struc-

tured data that researchers have used in previous research.

Structured data (e.g., source code) is well-defined and can

be readily parsed and understood by computer machinery.

Unstructured data (e.g., developer communication, issue

reports, documentation, email or meeting notes [2]), consists

of a mixture of natural language text and technical informa-

tion, such as code fragments, abbreviations, references to

objects in the source code, file names, logging information

Build ID: M20070212-1330

Steps To Reproduce:

1. Create a plugin for eclipse that includes a key binding for "M1+S" (ie. Alt+S)

where S is any letter that is used as a mnemonic in one of the top level

menus. Since eclipse uses "S" as the mnemonic for Help > &Software Updates,

"S" is sufficient. 2. Launch the plugin as part of Eclipse IDE

3. Press Alt+H to bring down the Help menu (to go along with our example in #1)

BUG: Notice "Software Updates" is missing its mnemonic.

More information: The code after "if (callback.isAcceleratorInUse

(SWT.ALT | character))" inside

Eclipse's MenuManager.java removes the mnemonic, but it seems like Eclipse

should be checking "isAcceleratorInUse" only for top level menumanagers like

File,Edit,...,Help, etc. :

/* (non-Javadoc)

* @see org.eclipse.jface.action.IContributionItem#update(java.lan

g.String)

*/ public void update(String pro

perty) {

IContributionItem items[] = getItems();

for (int i = 0; i < items.len

gth; i++) {

items[i].update(property);

} [...] } Any status on this bug?

I'd consider any contributions for M6 (API) or M7 (non-API) [...]

A 3.5 fix would be to make that behaviour optional in MenuManager with API and

off by default early in 3.5, and to have the WorkbenchActionBuilder contributed

MenuManagers and actionSets/editorActions contributed MenuManagers turn it on

(if I can find MenuManagers in the correct place).

I'd like us to work with the SWT team to make sure we understand what the

correct platform behavior is, and make sure that we aren't getting in the way

of that. The current behavior (i.e. turning off mnemonics) seems odd to me, in

general. If we're going to fix this, we should fix it properly.

Figure 1. Examples of technical information uncovered by a prototype

implementation of the approach proposed in this paper. (Eclipse Platform

Bug #208626).

or project-specific terms. As such, mining unstructured data

is challenging: it is meant for the exchange of information

between humans, rather than automated processing using

computer machinery. Figure 1 presents an example of tech-

nical information commonly found in unstructured data.

Recent approaches for discovering technical information

in unstructured data [3]–[5] have focussed on recognizing

and extracting only particular types of technical information,

such as class names [3], stack traces, or patches [5]. In order

to resolve the inherent ambiguities between natural language

text and technical information, these approaches are highly

specialized and tailored towards their specific use cases, and

limited in their scope. Furthermore, many kinds of technical

information (e.g, project-specific jargon or abbreviations)

cannot be extracted by any of the existing techniques.

As a first step towards a lightweight, general-purpose ap-

proach to uncovering technical information in unstructured

data, this work presents an approach that makes use of state-

of-the-art tools for checking and correcting the spelling and

grammar of electronically written texts. Technical informa-

tion is conceptually different from natural language text: it

often consists of words that are not part of standard language

dictionaries, violate grammatical conventions, and do not

respect morphological language rules. These characteristics

render modern spellcheckers ideal candidates for lightweight

• Use Spellchecking• Empirical validation• Improved on state of the art

9Wednesday, 11 April, 12

Page 12: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Proposed Approach

I. Extraction of communication data

II. Study impact on software quality

III. Study impact on development community

10Wednesday, 11 April, 12

Page 13: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Proposed Approach

I. Extraction of communication data

II. Study impact on software quality

III. Study impact on development community

10Wednesday, 11 April, 12

Page 14: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

I. Extraction of communication data

II. Study impact on software quality

III. Study impact on development community

Proposed Approach

11Wednesday, 11 April, 12

Page 15: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

I. Extraction of communication data

II. Study impact on software quality

III. Study impact on development community

Proposed Approach

11Wednesday, 11 April, 12

Page 16: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Quantify Impact on Quality: Idea

Extracted Communication Data

12Wednesday, 11 April, 12

Page 17: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Quantify Impact on Quality: Idea

Extracted Communication Data

Social Metrics

compute

12Wednesday, 11 April, 12

Page 18: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Quantify Impact on Quality: Idea

Extracted Communication Data

Social Metrics

compute

Post-Release Defects

measure relationships

12Wednesday, 11 April, 12

Page 19: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

DiscussionCONTENT

SocialSTRUCTURES

CommunicationDYNAMICS

Measures of WORKFLOW

4 Dimensionsof Measures

13Wednesday, 11 April, 12

Page 20: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Conceptual Approach

time

6 months

MeasureDiscussionMetrics

6 months

MeasurePost-Release

Bugs

LINK USING STATISTICAL MODELS14Wednesday, 11 April, 12

Page 21: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Findings of our work

(1) Social metrics explain post-release defects as good as code metrics.

15Wednesday, 11 April, 12

Page 22: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Findings of our work

(1) Social metrics explain post-release defects as good as code metrics.

(2) Combination of social metrics and code metrics is cumulative.

15Wednesday, 11 April, 12

Page 23: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Findings of our work

(1) Social metrics explain post-release defects as good as code metrics.

(2) Combination of social metrics and code metrics is cumulative.

(3) Identify factors that have positive and negative relationships with defects.

15Wednesday, 11 April, 12

Page 24: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Findings of our work

(1) Social metrics explain post-release defects as good as code metrics.

(2) Combination of social metrics and code metrics is cumulative.

(3) Identify factors that have positive and negative relationships with defects.

[ICPC‘2010] (Best Paper)[JEMSE?]

15Wednesday, 11 April, 12

Page 25: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

I. Extraction of communication data

II. Study impact on software quality

III. Study impact on development community

Proposed Approach

16Wednesday, 11 April, 12

Page 26: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

I. Extraction of communication data

II. Study impact on software quality

III. Study impact on development community

Proposed Approach

16Wednesday, 11 April, 12

Page 27: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

I. Extraction of communication data

II. Study impact on software quality

III. Study impact on development community

Proposed Approach

16Wednesday, 11 April, 12

Page 28: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Proposed Approach

I. Extraction of communication data

II. Study impact on software quality

III. Study impact on development community

17Wednesday, 11 April, 12

Page 29: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Proposed Approach

I. Extraction of communication data

II. Study impact on software quality

III. Study impact on development community

17Wednesday, 11 April, 12

Page 30: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Proposed Approach

I. Extraction of communication data

II. Study impact on software quality

III. Study impact on development community

17Wednesday, 11 April, 12

Page 31: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Available Knowledge in Data

Code Review Systems Mailing Lists Issue Tracking Systems

Data on Management of Code Contributions

18Wednesday, 11 April, 12

Page 32: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Submission

Review VerificationIntegration

ProjectRepository

Patch

Feedback

FeedbackOK OK

Contribution Management

19Wednesday, 11 April, 12

Page 33: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Studying Impact on Community throughContribution Management

Study how contributors, reviewers, verifiers and the software are impacted by communication (anomalies)through statistical models.

Goal:

Example:Reviewers leaving community due to lack of feedback

20Wednesday, 11 April, 12

Page 34: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Available Knowledge in Data

Version Control Systems Mailing Lists Issue Tracking Systems

Workflow InformationSocial Networks

21Wednesday, 11 April, 12

Page 35: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

mozilla

paulc

zhangchunlin

kbrosnan

sdwilsh

samuel.sidler+oldhasham8888

myles7897

deletesoftwareabillings

eddy_nigg

jmjeffery

sgautherie.bz

john.p.baker

l10n

adelfino

jo.hermans

jruderman

nightstalkerz

alice0775

hskupin

mmortal03

tchung

marcia

me.at.work

fittysix

steve.england

cbook

tonglebeak

ctalbert

VYV03354

ehsan

alex

nrthomas

aarobertxtr

smichaud shaver

johnjbartonmanujsabarwal

jdaggett

matt

bzbarsky

dtownsend

davemgarrett

info

stephen.donner

elmar.ludwig sdaugherty

mak77jdarmochwal

polidobj

vseerrortwalker

dietrich

mconnorbeltzner

steffen.wilberg

mano

highmind63

ria.klaassen

robert.bugzilla

edilee

kliu

faaborg

marco.zehesylvain.pasche bugzilla

rotisuliss

cl-bugs-new2

anselm.meyer

timwi

RainerStroebel

tomer

gavin.sharp

jbecerra

johnath

kev

martijn.martijn

cwwmozilla

longsonr

m-wada

zenikodveditz

matspal

philringnalda

zurtex

bomfog

cjcypoi02 corevette

masayukireed

phiw

timeless

matti

mh+mozilla

dao

klaas1988

sziadeh mark.finkle

UIJavaScriptEngine

XML Parser

Internet Explorer

Evolution of Code-Knowledge Communities

22Wednesday, 11 April, 12

Page 36: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Thesis Progress

Tools and techniquesfor mining communication repositories

Empirical Validationof presented tools and techniques

Empirical Validationof relationship between collaboration

and software quality.

Empirical Validationof relationship between collaboration

and development teams.

23Wednesday, 11 April, 12

Page 37: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Thesis Progress

Tools and techniquesfor mining communication repositories

Empirical Validationof presented tools and techniques

Empirical Validationof relationship between collaboration

and software quality.

Empirical Validationof relationship between collaboration

and development teams.

23Wednesday, 11 April, 12

Page 38: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Thesis Progress

Tools and techniquesfor mining communication repositories

Empirical Validationof presented tools and techniques

Empirical Validationof relationship between collaboration

and software quality.

Empirical Validationof relationship between collaboration

and development teams.

23Wednesday, 11 April, 12

Page 39: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Thesis Progress

Tools and techniquesfor mining communication repositories

Empirical Validationof presented tools and techniques

Empirical Validationof relationship between collaboration

and software quality.

Empirical Validationof relationship between collaboration

and development teams.

23Wednesday, 11 April, 12

Page 40: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Thesis Progress

Tools and techniquesfor mining communication repositories

Empirical Validationof presented tools and techniques

Empirical Validationof relationship between collaboration

and software quality.

Empirical Validationof relationship between collaboration

and development teams.

23Wednesday, 11 April, 12

Page 41: Mining Development Repositories to Study the Impact of Collaboration on Software Systems

Points for Discussion

• How to do evaluation of code-knowledge communities? (ground truth)?

• Applicability to industrial settings (almost no communication data records available)?

• Extend work to defect prediction?• Practical implications: management,

moderation, staffing, ... ?

24Wednesday, 11 April, 12