Tim Menzies, directions in Data Science

1

Perspectives from “PROMISE”:lessons learned, issues raised

[email protected]• 2002-2004: SE research chair, NASA• 2004

– “Lets start a repo for SE data”– Data mining in SE– Upload data used in papers

• 2013– 10 PROMISE conferences – 100s of data sets– 1000s of papers– Numerous journals special issues– Me: 1st and 3rd most cited papers in top SE journals (since 2007, 2009)

• 2014: Promise repo moving to NcState: tera-Promise

Tim Menzies with Burak Turhan, Ayse Basar Bener, Justin S. Di Stefano: On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering 14(5): 540-578 (2009)

Tim Menzies, Jeremy Greenwald, Art Frank: Data Mining Static Code Attributes to Learn Defect Predictors. IEEE Trans. Software Eng. 33(1): 2-13 (2007)

http://promisedata.googlecode.com

2

Issues raised in my work on PROMISE:(note: more than “just” SE)

• Learning for novel problems– Transfer learning from prior experience

• Locality– Trust (and when to stop trusting a model)– Anomaly detectors– Incremental repair

• Privacy (by exploiting locality)

• Goal-oriented analysis (next gen multi-objective optimizers)

• Human skill amplifiers– Data analysis patterns– Anti-patters in data science

3

Q: Are there general results in SE?A: Hell, yeah (Turkish Toasters to NASA rockets)

Tim Menzies with Burak Turhan, Ayse Basar Bener, Justin S. Di Stefano: On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering 14(5): 540-578 (2009)

4

Lesson: reflect LESS on raw dimensions and more on underlying “shape” of the data

Transferlearning:

Across time

Across space

Tim Menzies with Ekrem Kocaguneli, Emilia Mendes, Transfer learning in effort estimation Empirical Software Engineering (2014) 1-31

5

PROMISE, lessons learned:Need to study the analysts more

• Lesson from ten years of PROMISE:– Different teams– Similar data– Similar learners– Wildly varying results

• Variance less due to– Choice of learner– Choice of data set– Choice of selected features

• Variance most due to teams

• “It doesn’t matter what you do but it does matter who does it.”

Martin Shepperd, David Bowes, and Tracy Hall; Researcher Bias: The Use of Machine Learning in Software Defect Prediction; IEEE Trans SE, 40(6) JUNE 2014 603

6

Maybe more to data science than “I applied learner to data”.

For example: here’s the code I wrote in my latest paper

7

Data analysis patterns• Recommender systems for data mining scripts|• Method1:

– Text mining on documentation– Lightweight parsing of the scripts– Raise a red flag

• If clusters of terms in parse different to documentation

• Method2: track script executions– Learn “good” and “bad” combinations of tools – “Good” = high accuracy, less CPU, less memory,

less power– Combinations = markov chains of this used before

that• Case studies :

– ABB, software engineering analytics– HPCC ECL, funded by LexisNexis– Climate modeling in “R” (with Nagiza

Samatova)

8

Using “underlying shape” to implement privacy

• Goal: – Share, but do not reveal, while maintaining signal

• Cliff & Morph [Menzies&Peters’13]:– Replace N rows with M prototypes (M << N)

• N – M individuals now fully private– Mutate prototypes up to, but not over, mid-point of prototypes

• Remaining M individuals now obfuscated.

before afterOne of the few known privacy algorithms that does not harm

data mining

Tim Menzies, Fayola Peters, Liang Gong, Hongyu Zhang: Balancing Privacy and Utility in Cross-Company Defect Prediction. IEEE Trans. Software Eng. 39(8): 1054-1068 (2013)

9

Using “underlying shape” for(1) trust(2) anomaly detectors, (3) incremental repair A: Clustering

Tim Menzies, Andrew Butcher, David R. Cok, Andrian Marcus, Lucas Layman, Forrest Shull, Burak Turhan, Thomas Zimmermann: Local versus Global Lessons for Defect Prediction and Effort Estimation.

IEEE Trans. Software Eng. 39(6): 822-834 (2013)

10

Using “underlying shape” for next gen multi-objective optimizers?

A: (1) Cluster, then (2) envy, then (3) contrast

acap = n| sced = n| | stor = xh: _7 | | stor = n| | | cplx = h| | | | pcap = h: _10 | | | | pcap = n: _13 | | | cplx = n: _7 | | stor = vh: _11 | | stor = h: _11 | sced = l| | $kloc <= 16.3: _9 | | $kloc > 16.3: _8

1

2 3

11

BTW: “Shape-based” optimization much faster than standard MOEAs

Software process modeling• Solid= last slide;

dashed = NSGA-II [Deb02]• Also, we find tiny trees summarizing

trade space

Cockpit software design

• Red line= using trees to find optimizations

Education

Tim Menzies, directions in Data Science