11
Perspectives from “PROMISE”: lessons learned, issues raised [email protected] 2002-2004: SE research chair, NASA 2004 “Lets start a repo for SE data” Data mining in SE Upload data used in papers 2013 10 PROMISE conferences 100s of data sets 1000s of papers Numerous journals special issues Me: 1 st and 3 rd most cited papers in top SE journals (since 2007, 2009) 2014: Promise repo moving to NcState: tera-Promise 1 Tim Menzies with Burak Turhan, Ayse Basar Bener, Justin S. Di Stefano: On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering 14(5): 540-578 (2009) Tim Menzies, Jeremy Greenwald, Art Frank: Data Mining Static Code Attributes to Learn Defect Predictors. IEEE Trans. Software Eng. 33(1): 2-13 (2007) http://promisedata.googlecode.com

Tim Menzies, directions in Data Science

Embed Size (px)

DESCRIPTION

Perspectives from “PROMISE”: lessons learned, issues raised [email protected]

Citation preview

Page 1: Tim Menzies, directions in Data Science

1

Perspectives from “PROMISE”:lessons learned, issues raised

[email protected]• 2002-2004: SE research chair, NASA• 2004

– “Lets start a repo for SE data”– Data mining in SE– Upload data used in papers

• 2013– 10 PROMISE conferences – 100s of data sets– 1000s of papers– Numerous journals special issues– Me: 1st and 3rd most cited papers in top SE journals (since 2007, 2009)

• 2014: Promise repo moving to NcState: tera-Promise

Tim Menzies with Burak Turhan, Ayse Basar Bener, Justin S. Di Stefano: On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering 14(5): 540-578 (2009)

Tim Menzies, Jeremy Greenwald, Art Frank: Data Mining Static Code Attributes to Learn Defect Predictors. IEEE Trans. Software Eng. 33(1): 2-13 (2007)

http://promisedata.googlecode.com

Page 2: Tim Menzies, directions in Data Science

2

Issues raised in my work on PROMISE:(note: more than “just” SE)

• Learning for novel problems– Transfer learning from prior experience

• Locality– Trust (and when to stop trusting a model)– Anomaly detectors– Incremental repair

• Privacy (by exploiting locality)

• Goal-oriented analysis (next gen multi-objective optimizers)

• Human skill amplifiers– Data analysis patterns– Anti-patters in data science

Page 3: Tim Menzies, directions in Data Science

3

Q: Are there general results in SE?A: Hell, yeah (Turkish Toasters to NASA rockets)

Tim Menzies with Burak Turhan, Ayse Basar Bener, Justin S. Di Stefano: On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering 14(5): 540-578 (2009)

Page 4: Tim Menzies, directions in Data Science

4

Lesson: reflect LESS on raw dimensions and more on underlying “shape” of the data

Transferlearning:

Across time

Across space

Tim Menzies with Ekrem Kocaguneli, Emilia Mendes, Transfer learning in effort estimation Empirical Software Engineering (2014) 1-31

Page 5: Tim Menzies, directions in Data Science

5

PROMISE, lessons learned:Need to study the analysts more

• Lesson from ten years of PROMISE:– Different teams– Similar data– Similar learners– Wildly varying results

• Variance less due to– Choice of learner– Choice of data set– Choice of selected features

• Variance most due to teams

• “It doesn’t matter what you do but it does matter who does it.”

Martin Shepperd, David Bowes, and Tracy Hall; Researcher Bias: The Use of Machine Learning in Software Defect Prediction; IEEE Trans SE, 40(6) JUNE 2014 603

Page 6: Tim Menzies, directions in Data Science

6

Maybe more to data science than “I applied learner to data”.

For example: here’s the code I wrote in my latest paper

Page 7: Tim Menzies, directions in Data Science

7

Data analysis patterns• Recommender systems for data mining scripts|• Method1:

– Text mining on documentation– Lightweight parsing of the scripts– Raise a red flag

• If clusters of terms in parse different to documentation

• Method2: track script executions– Learn “good” and “bad” combinations of tools – “Good” = high accuracy, less CPU, less memory,

less power– Combinations = markov chains of this used before

that• Case studies :

– ABB, software engineering analytics– HPCC ECL, funded by LexisNexis– Climate modeling in “R” (with Nagiza

Samatova)

Page 8: Tim Menzies, directions in Data Science

8

Using “underlying shape” to implement privacy

• Goal: – Share, but do not reveal, while maintaining signal

• Cliff & Morph [Menzies&Peters’13]:– Replace N rows with M prototypes (M << N)

• N – M individuals now fully private– Mutate prototypes up to, but not over, mid-point of prototypes

• Remaining M individuals now obfuscated.

before afterOne of the few known privacy algorithms that does not harm

data mining

Tim Menzies, Fayola Peters, Liang Gong, Hongyu Zhang: Balancing Privacy and Utility in Cross-Company Defect Prediction. IEEE Trans. Software Eng. 39(8): 1054-1068 (2013)

Page 9: Tim Menzies, directions in Data Science

9

Using “underlying shape” for(1) trust(2) anomaly detectors, (3) incremental repair A: Clustering

Tim Menzies, Andrew Butcher, David R. Cok, Andrian Marcus, Lucas Layman, Forrest Shull, Burak Turhan, Thomas Zimmermann: Local versus Global Lessons for Defect Prediction and Effort Estimation.

IEEE Trans. Software Eng. 39(6): 822-834 (2013)

Page 10: Tim Menzies, directions in Data Science

10

Using “underlying shape” for next gen multi-objective optimizers?

A: (1) Cluster, then (2) envy, then (3) contrast

acap = n| sced = n| | stor = xh: _7 | | stor = n| | | cplx = h| | | | pcap = h: _10 | | | | pcap = n: _13 | | | cplx = n: _7 | | stor = vh: _11 | | stor = h: _11 | sced = l| | $kloc <= 16.3: _9 | | $kloc > 16.3: _8

1

2 3

Page 11: Tim Menzies, directions in Data Science

11

BTW: “Shape-based” optimization much faster than standard MOEAs

Software process modeling• Solid= last slide;

dashed = NSGA-II [Deb02]• Also, we find tiny trees summarizing

trade space

Cockpit software design

• Red line= using trees to find optimizations