25
Towards Mining Informal Online Data to Guide Component-Reuse Decisions Sanchit Karve 1 , Chris Scaffidi 2 1 McAfee Software (formerly Oregon State University) 2 Oregon State University

Towards Mining Informal Online Data to Guide Component-Reuse Decisions Sanchit Karve 1, Chris Scaffidi 2 1 McAfee Software (formerly Oregon State University)

Embed Size (px)

Citation preview

Page 1: Towards Mining Informal Online Data to Guide Component-Reuse Decisions Sanchit Karve 1, Chris Scaffidi 2 1 McAfee Software (formerly Oregon State University)

Towards Mining Informal Online Data to Guide Component-Reuse Decisions

Sanchit Karve1, Chris Scaffidi2

1 McAfee Software (formerly Oregon State University)2 Oregon State University

Page 2: Towards Mining Informal Online Data to Guide Component-Reuse Decisions Sanchit Karve 1, Chris Scaffidi 2 1 McAfee Software (formerly Oregon State University)

In some sort of ideal world…In some sort of ideal world…

1. Software engineers could specify an application– In a trivially easy yet formally precise notation

2. Tools would find resources to build the application– Including components from repositories

• That definitely have the needed quality attributes

3. Software engineers would assemble the application– And know for sure the application meets specification

Easier said than done!

In this paper, we'll focus here.

Page 3: Towards Mining Informal Online Data to Guide Component-Reuse Decisions Sanchit Karve 1, Chris Scaffidi 2 1 McAfee Software (formerly Oregon State University)

Typical component repositoryTypical component repository

Page 4: Towards Mining Informal Online Data to Guide Component-Reuse Decisions Sanchit Karve 1, Chris Scaffidi 2 1 McAfee Software (formerly Oregon State University)

Getting from where we are todayGetting from where we are todayto somewhere closer to idealto somewhere closer to ideal

Strategy: Focus software engineers' attention

• Repositories currently provide what we call "cues"– informal, potentially contradictory data to aid choices

• Can we use cues to estimate general reusability?

• If so, then software engineers' attention can be invested into analyzing promising components

Page 5: Towards Mining Informal Online Data to Guide Component-Reuse Decisions Sanchit Karve 1, Chris Scaffidi 2 1 McAfee Software (formerly Oregon State University)

Relevant work to dateRelevant work to date

• Gui & Scott (task-focused, requires access to code)– Estimating reusability based on 8 coupling metrics– Quantified reusability as the time required to reuse– Tested with one developer and 60 components– Correlation of 0.14 to 0.95 with various metrics

• OSSMETER (requires access to code history)– Will estimate qualities based on responsiveness of

developers to bug reports, frequency of releases, numbers of downloads, and similar historical data

– No studies to date?

• Our prior work (specific to scripting code)– Cues and machine learning to predict script reuse– Used to guide cues for us to focus on in our studies…

Page 6: Towards Mining Informal Online Data to Guide Component-Reuse Decisions Sanchit Karve 1, Chris Scaffidi 2 1 McAfee Software (formerly Oregon State University)

Key questionsKey questions

• Overall: Can we use cues to estimate reusability?– Focus on the data actually provided by repository– Focus on perceived general/holistic reusability

• Specific research questions:– Q1: Which cues, if any, are mutually correlated?– Q2: Which combinations of mutually-correlated cues,

if any, correlate with component reusability?

Page 7: Towards Mining Informal Online Data to Guide Component-Reuse Decisions Sanchit Karve 1, Chris Scaffidi 2 1 McAfee Software (formerly Oregon State University)

Study #1Study #1

• Which cues, if any, are mutually correlated?

• Procedure:– Retrieved 1200 randomly-chosen C# components

from an online repository (CodeProject.com)– Repository provided documentation, online forum

comments about components, other informal data– Computed 24 cues (next slide), with data cleaning– Perform factor analysis to identify factors of cues

(i.e., groups of cues) that were mutually correlated

Page 8: Towards Mining Informal Online Data to Guide Component-Reuse Decisions Sanchit Karve 1, Chris Scaffidi 2 1 McAfee Software (formerly Oregon State University)

Cues computedCues computed

Popularity and certification(a) Average Rating (b) (Total # views) / (component age) (c) (Total # bookmarks) / (component age) (d) # external links that link to component’s page (e) Certified by CodeProject as meeting standards (f) Total # Downloads

Opinions reflected in comments(g) Total # comments(h) # positive opinion words in comments (i) # negative opinion words in comments (j) # comments flagged as “bug report”(k) # comments flagged as “rant”(l) # comments flagged as “question”

Documentation, licensing and compatibility(m) # code examples in documentation (n) # paragraphs in documentation (o) # screenshots in documentation (p) unrestricted software license offered (q) # of compatible execution platforms

Component Author Information(r) # previous components released by author(s) Authors’ avg. rating on previous components (t) # total components released by author (u) Authors’ avg. rating on all components (v) Component author name given(w) Author describes self as senior developer(x) Author identifies self as American

Page 9: Towards Mining Informal Online Data to Guide Component-Reuse Decisions Sanchit Karve 1, Chris Scaffidi 2 1 McAfee Software (formerly Oregon State University)

Data cleaningData cleaning

• Removed cues (columns) and components (rows) from our data set that had no useful information.– Three columns (j through l) that categorized

component-consumer comments as “bug,” “rant,” or “question”—not one comment was explicitly marked

– Removed 24 components with incomplete information

• Result: 21 cues over 1176 components

Page 10: Towards Mining Informal Online Data to Guide Component-Reuse Decisions Sanchit Karve 1, Chris Scaffidi 2 1 McAfee Software (formerly Oregon State University)

Exploratory factor analysis

• Purpose: identify a small number of underlying factors that summarize the covariance among different variables (cues in our case)

• Include factors if they meet support heuristic (Cattel scree test, with optimal coordinates criterion)

• Result: 3 groups of 10 cues (total)

Page 11: Towards Mining Informal Online Data to Guide Component-Reuse Decisions Sanchit Karve 1, Chris Scaffidi 2 1 McAfee Software (formerly Oregon State University)

Factors identifiedFactors identified

Factor 1: ExcitementF1 = 0.76 c + 0.39 d + 0.95 g + 0.82 h

(c) (Total # bookmarks) / (component age)(d) # external links that link to component’s page (g) Total # comments(h) # positive opinion words in comments

Factor 2: ExperienceF2 = 0.81 r + 0.30 s + 1.03 t

(r) # previous components released by author(s) Authors’ avg. rating on previous components(t) # total components released by author

Factor 3: DocumentationF3 = 0.53 m + 0.64 o + 0.38 p

(m) # code examples in documentation(n) # paragraphs in documentation(o) # screenshots in documentation

Page 12: Towards Mining Informal Online Data to Guide Component-Reuse Decisions Sanchit Karve 1, Chris Scaffidi 2 1 McAfee Software (formerly Oregon State University)

Study #2Study #2

• Which combinations of mutually-correlated cues, if any, correlate with component reusability?

• Procedure: – Interviewed software engineers about components

they had previously used for actual everyday work• Quantitative rating of component reusability• Qualitative statiments about reusability

– Compute 3 factors for each component – Multiple linear regression to determine relationship

between perceived reusability and factors – Grounded analysis of qualitative statements to check

for qualitative agreement with statistical results

Page 13: Towards Mining Informal Online Data to Guide Component-Reuse Decisions Sanchit Karve 1, Chris Scaffidi 2 1 McAfee Software (formerly Oregon State University)

Data for statistical analysisData for statistical analysis

• Relevant interview questions:– “What’s the URL of a component that you tried to

reuse?”– “On a scale of 1 to 5, where 1 is very hard to reuse

and 5 is very easy to reuse, how would you rate the reusability of this component?”

• 11 participants, who mentioned 38 components

• We could not locate 2 components on the repository

• Leaving 36 components for which we computed factors from cues

Page 14: Towards Mining Informal Online Data to Guide Component-Reuse Decisions Sanchit Karve 1, Chris Scaffidi 2 1 McAfee Software (formerly Oregon State University)

Statistical resultsStatistical results

• Multiple regression of perceived reusability versus factors: P<0.01, r2=0.35

• Factor-specific P values:– Excitement: P=0.72… not significant– Experience: P=0.03– Documentation: P=0.01

• Conclusion: 1/3 of overall variance was explained by the Experience and Documentation factors

Page 15: Towards Mining Informal Online Data to Guide Component-Reuse Decisions Sanchit Karve 1, Chris Scaffidi 2 1 McAfee Software (formerly Oregon State University)

Qualitative analysis: Qualitative analysis: Reasons for reuseReasons for reuse

• Interview question:– “When you downloaded the component, what led you

to believe that it was reusable?”

• Reasons stated:– 36 components discussed– 63 mentions of some reason for why participants

chose certain components for reuse– Most reasons aligned well with our factors

(documentation dominant)

Page 16: Towards Mining Informal Online Data to Guide Component-Reuse Decisions Sanchit Karve 1, Chris Scaffidi 2 1 McAfee Software (formerly Oregon State University)

4

1

12

14

13

11

9

5

21

0

10

20

30

40

50

60

Documentation Excitement Experience Other

Men

tions

by

inte

rvie

wee

sDocumentation said component was extendable (1)

A demonstration application was provided (2)

Lists features I was interested in learning about (5)

Screenshots were provided (9)

Code examples were provided (11)

Clear and detailed documentation text (13)

Title/description similar to what I wanted (14)

Successful reuse was reflected in comments (2)

Good prior work by the component producer (1)

Used an API familiar to me (1)

No other alternative found (4)

Documentation

Excitement

Experience

Other

Page 17: Towards Mining Informal Online Data to Guide Component-Reuse Decisions Sanchit Karve 1, Chris Scaffidi 2 1 McAfee Software (formerly Oregon State University)

4

1

12

14

13

11

9

5

21

0

10

20

30

40

50

60

Documentation Excitement Experience Other

Men

tions

by

inte

rvie

wee

sDocumentation said component was extendable (1)

A demonstration application was provided (2)

Lists features I was interested in learning about (5)

Screenshots were provided (9)

Code examples were provided (11)

Clear and detailed documentation text (13)

Title/description similar to what I wanted (14)

Successful reuse was reflected in comments (2)

Good prior work by the component producer (1)

Used an API familiar to me (1)

No other alternative found (4)

Documentation

Excitement

Experience

Other

4

1

12

14

13

11

9

5

21

0

10

20

30

40

50

60

Documentation Excitement Experience Other

Men

tions

by

inte

rvie

wee

s

Documentation said component was extendable (1)

A demonstration application was provided (2)

Lists features I was interested in learning about (5)

Screenshots were provided (9)

Code examples were provided (11)

Clear and detailed documentation text (13)

Title/description similar to what I wanted (14)

Successful reuse was reflected in comments (2)

Good prior work by the component producer (1)

Used an API familiar to me (1)

No other alternative found (4)

Documentation

Excitement

Experience

Other

Page 18: Towards Mining Informal Online Data to Guide Component-Reuse Decisions Sanchit Karve 1, Chris Scaffidi 2 1 McAfee Software (formerly Oregon State University)

4

1

12

14

13

11

9

5

21

0

10

20

30

40

50

60

Documentation Excitement Experience Other

Men

tions

by

inte

rvie

wee

sDocumentation said component was extendable (1)

A demonstration application was provided (2)

Lists features I was interested in learning about (5)

Screenshots were provided (9)

Code examples were provided (11)

Clear and detailed documentation text (13)

Title/description similar to what I wanted (14)

Successful reuse was reflected in comments (2)

Good prior work by the component producer (1)

Used an API familiar to me (1)

No other alternative found (4)

Documentation

Excitement

Experience

Other

4

1

12

14

13

11

9

5

21

0

10

20

30

40

50

60

Documentation Excitement Experience Other

Men

tions

by

inte

rvie

wee

sDocumentation said component was extendable (1)

A demonstration application was provided (2)

Lists features I was interested in learning about (5)

Screenshots were provided (9)

Code examples were provided (11)

Clear and detailed documentation text (13)

Title/description similar to what I wanted (14)

Successful reuse was reflected in comments (2)

Good prior work by the component producer (1)

Used an API familiar to me (1)

No other alternative found (4)

Documentation

Excitement

Experience

Other

Page 19: Towards Mining Informal Online Data to Guide Component-Reuse Decisions Sanchit Karve 1, Chris Scaffidi 2 1 McAfee Software (formerly Oregon State University)

Qualitative analysis: Qualitative analysis: Problems during reuseProblems during reuse

• Interview question:– “What problems, if any, did you encounter while trying to reuse

this component?”

• Problems stated: – 36 components were discussed– 21 components presented at least one problem– 36 mentions of some problem during reuse

Page 20: Towards Mining Informal Online Data to Guide Component-Reuse Decisions Sanchit Karve 1, Chris Scaffidi 2 1 McAfee Software (formerly Oregon State University)

Mentions of problems during reuseMentions of problems during reuse

3

2

7

9

15

Other

Component did not work as advertised

Component too complex to understand

Not sure how to actually use the component

Not sure how the component works

3

2

7

9

15

Other

Component did not work as advertised

Component too complex to understand

Not sure how to actually use the component

Not sure how the component works

3

2

7

9

15

Other

Component did not work as advertised

Component too complex to understand

Not sure how to actually use the component

Not sure how the component works

Page 21: Towards Mining Informal Online Data to Guide Component-Reuse Decisions Sanchit Karve 1, Chris Scaffidi 2 1 McAfee Software (formerly Oregon State University)

Limitations and threats to validityLimitations and threats to validity

• We did not consider the source code – Could be a good source of information in future work

• We did not try to measure what people were feeling – Might affect what the "Excitement" factor means– But probably unimportant, as it was not a statistically

significant predictor of perceived reusability anyway

• Different repositories present different cues– Repository-specific factor weights would need to be

computed using method similar to our own– We anticipate reusability could be estimated if a

repository presents software engineers with cues related to documentation and experience

Page 22: Towards Mining Informal Online Data to Guide Component-Reuse Decisions Sanchit Karve 1, Chris Scaffidi 2 1 McAfee Software (formerly Oregon State University)

Conclusions with implications for Conclusions with implications for repository designrepository design

• Excitement did not predict reusability– Perhaps the feature for entering comments and reviews

should be removed from the repository

• Experience predicted reusability, but it was not mentioned much in answers about reuse– Perhaps experience-related cues should be displayed

more succinctly and prominently

• Documentation predicted reusability, and it was mentioned a lot in answers about reuse– Perhaps repositories should promote components that

include cues which are correlated with reusability

• Many cues were members of no factor– Perhaps remove these from the repository

Page 23: Towards Mining Informal Online Data to Guide Component-Reuse Decisions Sanchit Karve 1, Chris Scaffidi 2 1 McAfee Software (formerly Oregon State University)

Future research opportunitiesFuture research opportunities

• Expand the set of cues considered– Including cues based on the code

• Develop more specialized estimation models– For example, specialized cues could be collected and

used to characterize the performance, security, usability, reliability, and maintainability of components

• Integrate the informal, cue-based approach with formal methods aimed at assuring quality attributes– For example, if a cue-based estimation indicates high

general reusability but a low quality attribute of concern, advise formal verification of that quality

Page 24: Towards Mining Informal Online Data to Guide Component-Reuse Decisions Sanchit Karve 1, Chris Scaffidi 2 1 McAfee Software (formerly Oregon State University)

Stepping back: Implications for CBSEStepping back: Implications for CBSE

• This study is a reminder, once again, that software engineers are human beings

– Specifically: An impediment to component reuse is the need for human beings to understand the components

– And specifically: An impediment to reuse is the absence of formalized information about components (plus, the difficulty of getting that information for all components)

• How can we help to overcome these impediments?

Page 25: Towards Mining Informal Online Data to Guide Component-Reuse Decisions Sanchit Karve 1, Chris Scaffidi 2 1 McAfee Software (formerly Oregon State University)

Acknowledgements and thanksAcknowledgements and thanks

• To CBSE for this opportunity to present• To CodeProject.com for help with studies• To Mary Shaw at CMU for useful feedback• To NSF for funding (1101107)