Upload
cs-ncstate
View
226
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Perspectives from “PROMISE”:lessons learned, issues raised[email protected]
Citation preview
1
Perspectives from “PROMISE”:lessons learned, issues raised
[email protected]• 2002-2004: SE research chair, NASA• 2004
– “Lets start a repo for SE data”– Data mining in SE– Upload data used in papers
• 2013– 10 PROMISE conferences – 100s of data sets– 1000s of papers– Numerous journals special issues– Me: 1st and 3rd most cited papers in top SE journals (since 2007, 2009)
• 2014: Promise repo moving to NcState: tera-Promise
Tim Menzies with Burak Turhan, Ayse Basar Bener, Justin S. Di Stefano: On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering 14(5): 540-578 (2009)
Tim Menzies, Jeremy Greenwald, Art Frank: Data Mining Static Code Attributes to Learn Defect Predictors. IEEE Trans. Software Eng. 33(1): 2-13 (2007)
http://promisedata.googlecode.com
2
Issues raised in my work on PROMISE:(note: more than “just” SE)
• Learning for novel problems– Transfer learning from prior experience
• Locality– Trust (and when to stop trusting a model)– Anomaly detectors– Incremental repair
• Privacy (by exploiting locality)
• Goal-oriented analysis (next gen multi-objective optimizers)
• Human skill amplifiers– Data analysis patterns– Anti-patters in data science
3
Q: Are there general results in SE?A: Hell, yeah (Turkish Toasters to NASA rockets)
Tim Menzies with Burak Turhan, Ayse Basar Bener, Justin S. Di Stefano: On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering 14(5): 540-578 (2009)
4
Lesson: reflect LESS on raw dimensions and more on underlying “shape” of the data
Transferlearning:
Across time
Across space
Tim Menzies with Ekrem Kocaguneli, Emilia Mendes, Transfer learning in effort estimation Empirical Software Engineering (2014) 1-31
5
PROMISE, lessons learned:Need to study the analysts more
• Lesson from ten years of PROMISE:– Different teams– Similar data– Similar learners– Wildly varying results
• Variance less due to– Choice of learner– Choice of data set– Choice of selected features
• Variance most due to teams
• “It doesn’t matter what you do but it does matter who does it.”
Martin Shepperd, David Bowes, and Tracy Hall; Researcher Bias: The Use of Machine Learning in Software Defect Prediction; IEEE Trans SE, 40(6) JUNE 2014 603
6
Maybe more to data science than “I applied learner to data”.
For example: here’s the code I wrote in my latest paper
7
Data analysis patterns• Recommender systems for data mining scripts|• Method1:
– Text mining on documentation– Lightweight parsing of the scripts– Raise a red flag
• If clusters of terms in parse different to documentation
• Method2: track script executions– Learn “good” and “bad” combinations of tools – “Good” = high accuracy, less CPU, less memory,
less power– Combinations = markov chains of this used before
that• Case studies :
– ABB, software engineering analytics– HPCC ECL, funded by LexisNexis– Climate modeling in “R” (with Nagiza
Samatova)
8
Using “underlying shape” to implement privacy
• Goal: – Share, but do not reveal, while maintaining signal
• Cliff & Morph [Menzies&Peters’13]:– Replace N rows with M prototypes (M << N)
• N – M individuals now fully private– Mutate prototypes up to, but not over, mid-point of prototypes
• Remaining M individuals now obfuscated.
before afterOne of the few known privacy algorithms that does not harm
data mining
Tim Menzies, Fayola Peters, Liang Gong, Hongyu Zhang: Balancing Privacy and Utility in Cross-Company Defect Prediction. IEEE Trans. Software Eng. 39(8): 1054-1068 (2013)
9
Using “underlying shape” for(1) trust(2) anomaly detectors, (3) incremental repair A: Clustering
Tim Menzies, Andrew Butcher, David R. Cok, Andrian Marcus, Lucas Layman, Forrest Shull, Burak Turhan, Thomas Zimmermann: Local versus Global Lessons for Defect Prediction and Effort Estimation.
IEEE Trans. Software Eng. 39(6): 822-834 (2013)
10
Using “underlying shape” for next gen multi-objective optimizers?
A: (1) Cluster, then (2) envy, then (3) contrast
acap = n| sced = n| | stor = xh: _7 | | stor = n| | | cplx = h| | | | pcap = h: _10 | | | | pcap = n: _13 | | | cplx = n: _7 | | stor = vh: _11 | | stor = h: _11 | sced = l| | $kloc <= 16.3: _9 | | $kloc > 16.3: _8
1
2 3
11
BTW: “Shape-based” optimization much faster than standard MOEAs
Software process modeling• Solid= last slide;
dashed = NSGA-II [Deb02]• Also, we find tiny trees summarizing
trade space
Cockpit software design
• Red line= using trees to find optimizations