Upload
godwin-wilcox
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Towards Mining Informal Online Data to Guide Component-Reuse Decisions
Sanchit Karve1, Chris Scaffidi2
1 McAfee Software (formerly Oregon State University)2 Oregon State University
In some sort of ideal world…In some sort of ideal world…
1. Software engineers could specify an application– In a trivially easy yet formally precise notation
2. Tools would find resources to build the application– Including components from repositories
• That definitely have the needed quality attributes
3. Software engineers would assemble the application– And know for sure the application meets specification
Easier said than done!
In this paper, we'll focus here.
Typical component repositoryTypical component repository
Getting from where we are todayGetting from where we are todayto somewhere closer to idealto somewhere closer to ideal
Strategy: Focus software engineers' attention
• Repositories currently provide what we call "cues"– informal, potentially contradictory data to aid choices
• Can we use cues to estimate general reusability?
• If so, then software engineers' attention can be invested into analyzing promising components
Relevant work to dateRelevant work to date
• Gui & Scott (task-focused, requires access to code)– Estimating reusability based on 8 coupling metrics– Quantified reusability as the time required to reuse– Tested with one developer and 60 components– Correlation of 0.14 to 0.95 with various metrics
• OSSMETER (requires access to code history)– Will estimate qualities based on responsiveness of
developers to bug reports, frequency of releases, numbers of downloads, and similar historical data
– No studies to date?
• Our prior work (specific to scripting code)– Cues and machine learning to predict script reuse– Used to guide cues for us to focus on in our studies…
Key questionsKey questions
• Overall: Can we use cues to estimate reusability?– Focus on the data actually provided by repository– Focus on perceived general/holistic reusability
• Specific research questions:– Q1: Which cues, if any, are mutually correlated?– Q2: Which combinations of mutually-correlated cues,
if any, correlate with component reusability?
Study #1Study #1
• Which cues, if any, are mutually correlated?
• Procedure:– Retrieved 1200 randomly-chosen C# components
from an online repository (CodeProject.com)– Repository provided documentation, online forum
comments about components, other informal data– Computed 24 cues (next slide), with data cleaning– Perform factor analysis to identify factors of cues
(i.e., groups of cues) that were mutually correlated
Cues computedCues computed
Popularity and certification(a) Average Rating (b) (Total # views) / (component age) (c) (Total # bookmarks) / (component age) (d) # external links that link to component’s page (e) Certified by CodeProject as meeting standards (f) Total # Downloads
Opinions reflected in comments(g) Total # comments(h) # positive opinion words in comments (i) # negative opinion words in comments (j) # comments flagged as “bug report”(k) # comments flagged as “rant”(l) # comments flagged as “question”
Documentation, licensing and compatibility(m) # code examples in documentation (n) # paragraphs in documentation (o) # screenshots in documentation (p) unrestricted software license offered (q) # of compatible execution platforms
Component Author Information(r) # previous components released by author(s) Authors’ avg. rating on previous components (t) # total components released by author (u) Authors’ avg. rating on all components (v) Component author name given(w) Author describes self as senior developer(x) Author identifies self as American
Data cleaningData cleaning
• Removed cues (columns) and components (rows) from our data set that had no useful information.– Three columns (j through l) that categorized
component-consumer comments as “bug,” “rant,” or “question”—not one comment was explicitly marked
– Removed 24 components with incomplete information
• Result: 21 cues over 1176 components
Exploratory factor analysis
• Purpose: identify a small number of underlying factors that summarize the covariance among different variables (cues in our case)
• Include factors if they meet support heuristic (Cattel scree test, with optimal coordinates criterion)
• Result: 3 groups of 10 cues (total)
Factors identifiedFactors identified
Factor 1: ExcitementF1 = 0.76 c + 0.39 d + 0.95 g + 0.82 h
(c) (Total # bookmarks) / (component age)(d) # external links that link to component’s page (g) Total # comments(h) # positive opinion words in comments
Factor 2: ExperienceF2 = 0.81 r + 0.30 s + 1.03 t
(r) # previous components released by author(s) Authors’ avg. rating on previous components(t) # total components released by author
Factor 3: DocumentationF3 = 0.53 m + 0.64 o + 0.38 p
(m) # code examples in documentation(n) # paragraphs in documentation(o) # screenshots in documentation
Study #2Study #2
• Which combinations of mutually-correlated cues, if any, correlate with component reusability?
• Procedure: – Interviewed software engineers about components
they had previously used for actual everyday work• Quantitative rating of component reusability• Qualitative statiments about reusability
– Compute 3 factors for each component – Multiple linear regression to determine relationship
between perceived reusability and factors – Grounded analysis of qualitative statements to check
for qualitative agreement with statistical results
Data for statistical analysisData for statistical analysis
• Relevant interview questions:– “What’s the URL of a component that you tried to
reuse?”– “On a scale of 1 to 5, where 1 is very hard to reuse
and 5 is very easy to reuse, how would you rate the reusability of this component?”
• 11 participants, who mentioned 38 components
• We could not locate 2 components on the repository
• Leaving 36 components for which we computed factors from cues
Statistical resultsStatistical results
• Multiple regression of perceived reusability versus factors: P<0.01, r2=0.35
• Factor-specific P values:– Excitement: P=0.72… not significant– Experience: P=0.03– Documentation: P=0.01
• Conclusion: 1/3 of overall variance was explained by the Experience and Documentation factors
Qualitative analysis: Qualitative analysis: Reasons for reuseReasons for reuse
• Interview question:– “When you downloaded the component, what led you
to believe that it was reusable?”
• Reasons stated:– 36 components discussed– 63 mentions of some reason for why participants
chose certain components for reuse– Most reasons aligned well with our factors
(documentation dominant)
4
1
12
14
13
11
9
5
21
0
10
20
30
40
50
60
Documentation Excitement Experience Other
Men
tions
by
inte
rvie
wee
sDocumentation said component was extendable (1)
A demonstration application was provided (2)
Lists features I was interested in learning about (5)
Screenshots were provided (9)
Code examples were provided (11)
Clear and detailed documentation text (13)
Title/description similar to what I wanted (14)
Successful reuse was reflected in comments (2)
Good prior work by the component producer (1)
Used an API familiar to me (1)
No other alternative found (4)
Documentation
Excitement
Experience
Other
4
1
12
14
13
11
9
5
21
0
10
20
30
40
50
60
Documentation Excitement Experience Other
Men
tions
by
inte
rvie
wee
sDocumentation said component was extendable (1)
A demonstration application was provided (2)
Lists features I was interested in learning about (5)
Screenshots were provided (9)
Code examples were provided (11)
Clear and detailed documentation text (13)
Title/description similar to what I wanted (14)
Successful reuse was reflected in comments (2)
Good prior work by the component producer (1)
Used an API familiar to me (1)
No other alternative found (4)
Documentation
Excitement
Experience
Other
4
1
12
14
13
11
9
5
21
0
10
20
30
40
50
60
Documentation Excitement Experience Other
Men
tions
by
inte
rvie
wee
s
Documentation said component was extendable (1)
A demonstration application was provided (2)
Lists features I was interested in learning about (5)
Screenshots were provided (9)
Code examples were provided (11)
Clear and detailed documentation text (13)
Title/description similar to what I wanted (14)
Successful reuse was reflected in comments (2)
Good prior work by the component producer (1)
Used an API familiar to me (1)
No other alternative found (4)
Documentation
Excitement
Experience
Other
4
1
12
14
13
11
9
5
21
0
10
20
30
40
50
60
Documentation Excitement Experience Other
Men
tions
by
inte
rvie
wee
sDocumentation said component was extendable (1)
A demonstration application was provided (2)
Lists features I was interested in learning about (5)
Screenshots were provided (9)
Code examples were provided (11)
Clear and detailed documentation text (13)
Title/description similar to what I wanted (14)
Successful reuse was reflected in comments (2)
Good prior work by the component producer (1)
Used an API familiar to me (1)
No other alternative found (4)
Documentation
Excitement
Experience
Other
4
1
12
14
13
11
9
5
21
0
10
20
30
40
50
60
Documentation Excitement Experience Other
Men
tions
by
inte
rvie
wee
sDocumentation said component was extendable (1)
A demonstration application was provided (2)
Lists features I was interested in learning about (5)
Screenshots were provided (9)
Code examples were provided (11)
Clear and detailed documentation text (13)
Title/description similar to what I wanted (14)
Successful reuse was reflected in comments (2)
Good prior work by the component producer (1)
Used an API familiar to me (1)
No other alternative found (4)
Documentation
Excitement
Experience
Other
Qualitative analysis: Qualitative analysis: Problems during reuseProblems during reuse
• Interview question:– “What problems, if any, did you encounter while trying to reuse
this component?”
• Problems stated: – 36 components were discussed– 21 components presented at least one problem– 36 mentions of some problem during reuse
Mentions of problems during reuseMentions of problems during reuse
3
2
7
9
15
Other
Component did not work as advertised
Component too complex to understand
Not sure how to actually use the component
Not sure how the component works
3
2
7
9
15
Other
Component did not work as advertised
Component too complex to understand
Not sure how to actually use the component
Not sure how the component works
3
2
7
9
15
Other
Component did not work as advertised
Component too complex to understand
Not sure how to actually use the component
Not sure how the component works
Limitations and threats to validityLimitations and threats to validity
• We did not consider the source code – Could be a good source of information in future work
• We did not try to measure what people were feeling – Might affect what the "Excitement" factor means– But probably unimportant, as it was not a statistically
significant predictor of perceived reusability anyway
• Different repositories present different cues– Repository-specific factor weights would need to be
computed using method similar to our own– We anticipate reusability could be estimated if a
repository presents software engineers with cues related to documentation and experience
Conclusions with implications for Conclusions with implications for repository designrepository design
• Excitement did not predict reusability– Perhaps the feature for entering comments and reviews
should be removed from the repository
• Experience predicted reusability, but it was not mentioned much in answers about reuse– Perhaps experience-related cues should be displayed
more succinctly and prominently
• Documentation predicted reusability, and it was mentioned a lot in answers about reuse– Perhaps repositories should promote components that
include cues which are correlated with reusability
• Many cues were members of no factor– Perhaps remove these from the repository
Future research opportunitiesFuture research opportunities
• Expand the set of cues considered– Including cues based on the code
• Develop more specialized estimation models– For example, specialized cues could be collected and
used to characterize the performance, security, usability, reliability, and maintainability of components
• Integrate the informal, cue-based approach with formal methods aimed at assuring quality attributes– For example, if a cue-based estimation indicates high
general reusability but a low quality attribute of concern, advise formal verification of that quality
Stepping back: Implications for CBSEStepping back: Implications for CBSE
• This study is a reminder, once again, that software engineers are human beings
– Specifically: An impediment to component reuse is the need for human beings to understand the components
– And specifically: An impediment to reuse is the absence of formalized information about components (plus, the difficulty of getting that information for all components)
• How can we help to overcome these impediments?
Acknowledgements and thanksAcknowledgements and thanks
• To CBSE for this opportunity to present• To CodeProject.com for help with studies• To Mary Shaw at CMU for useful feedback• To NSF for funding (1101107)