24
Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Applied Machine Learning in Software Engineering © 2017 Carnegie Mellon University [Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution. Applied Machine Learning in Software Engineering Eliezer Kanal

Applied Machine Learning in Software Engineering...The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience,

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Applied Machine Learning in Software Engineering...The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience,

1Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

1

Software Solutions Symposium 2017

Software Engineering InstituteCarnegie Mellon UniversityPittsburgh, PA 15213

Applied Machine Learning in Software Engineering© 2017 Carnegie Mellon University

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

Applied Machine Learning in Software EngineeringEliezer Kanal

Page 2: Applied Machine Learning in Software Engineering...The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience,

2Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

2

Software Solutions Symposium 2017

Copyright 2017 Carnegie Mellon University

This material is based upon work funded and supported by the Department of Defense under Contract No. FA8721-05-C-0003 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center.

NO WARRANTY. THIS CARNEGIE MELLON UNIVERSITY AND SOFTWARE ENGINEERING INSTITUTE MATERIAL IS FURNISHED ON AN “AS-IS” BASIS. CARNEGIE MELLON UNIVERSITY MAKES NO WARRANTIES OF ANY KIND, EITHER EXPRESSED OR IMPLIED, AS TO ANY MATTER INCLUDING, BUT NOT LIMITED TO, WARRANTY OF FITNESS FOR PURPOSE OR MERCHANTABILITY, EXCLUSIVITY, OR RESULTS OBTAINED FROM USE OF THE MATERIAL. CARNEGIE MELLON UNIVERSITY DOES NOT MAKE ANY WARRANTY OF ANY KIND WITH RESPECT TO FREEDOM FROM PATENT, TRADEMARK, OR COPYRIGHT INFRINGEMENT.

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

This material may be reproduced in its entirety, without modification, and freely distributed in written or electronic form without requesting formal permission. Permission is required for any other use. Requests for permission should be directed to the Software Engineering Institute at [email protected].

Carnegie Mellon® and CERT® are registered marks of Carnegie Mellon University.

DM-0004563

Page 3: Applied Machine Learning in Software Engineering...The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience,

3Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

3

Software Solutions Symposium 2017

Tom Mitchell, former CMU Machine Learning department chair:

The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience, and what are the fundamental laws that govern all learning processes?”

Machine Learning seeks to automate data analysis and inference.

What is Machine Learning?

Page 4: Applied Machine Learning in Software Engineering...The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience,

4Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

4

Software Solutions Symposium 2017Picture

(optional)What is Machine Learning?

If your problem can be stated as either of the following:

…you would likely benefit from machine learning.

I would like to use _____datato predict _____.I would like to use ____ data

to guess what ____ is.

Page 5: Applied Machine Learning in Software Engineering...The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience,

5Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

5

Software Solutions Symposium 2017

What is Machine Learning?

Sample Techniques:

• Regression

• K-Means Clustering

Clustering image: Weston.pace, https://commons.wikimedia.org/wiki/File:K_Means_Example_Step_4.svg

Page 6: Applied Machine Learning in Software Engineering...The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience,

6Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

6

Software Solutions Symposium 2017

What is Machine Learning?

Feature Engineering:Using existing data to create more informative data

Data Types

Image StaticVideo

Time series Financial dataEvent counts

Structured text Web formsStructured data

(JSON, XML)Source code

Free text NewsTweetsEmail

many more…

Page 7: Applied Machine Learning in Software Engineering...The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience,

7Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

7

Software Solutions Symposium 2017Picture

(optional)What is Machine Learning?

Examples:

o I would like to use incident ticket data to predictcustomer needs .

o I would like to use publicly available code to predict what code I will write .

o I would like to use bug report data to guess the location of undetected bugs in my code .

Page 8: Applied Machine Learning in Software Engineering...The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience,

8Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

8

Software Solutions Symposium 2017Picture

(optional)

Page 9: Applied Machine Learning in Software Engineering...The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience,

9Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

9

Software Solutions Symposium 2017Picture

(optional)What is Machine Learning?

Examples:

o I would like to use incident ticket data to predictcustomer needs .

o I would like to use publicly available code to predict what code I will write .

o I would like to use bug report data to guess the location of undetected bugs in my code .

Page 10: Applied Machine Learning in Software Engineering...The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience,

10Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

10

Software Solutions Symposium 2017

Title of the Presentation Goes Here© 2017 Carnegie Mellon University

[DISTRIBUTION STATEMENT Please copy and paste the appropriate distribution statement into this space.]

Applied ML:Vulnerability Detection

Page 11: Applied Machine Learning in Software Engineering...The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience,

11Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

11

Software Solutions Symposium 2017

Applied ML: Vulnerability Detection

Analyzer

Analyzer

Analyzer

Codebases

Alerts Today

3,147

11,772

48,690

0

10,000

20,000

30,000

40,000

50,000

60,000

TP FP Susp

Many alerts left unaudited!

Page 12: Applied Machine Learning in Software Engineering...The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience,

12Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

12

Software Solutions Symposium 2017

Applied ML: Vulnerability Detection

Analyzer

Analyzer

Analyzer

Codebases

Alerts Today

Our Goal

3,147

11,772

48,690

0

10,000

20,000

30,000

40,000

50,000

60,000

TP FP Susp

66 effort days

12,076

45,172

6,361

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

45,000

50,000

e-TP e-FP I

Automated Statistical Classifier

• Expected True Positive (e-TP)• Expected False Positive (e-FP) • Indeterminate (I)

Page 13: Applied Machine Learning in Software Engineering...The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience,

13Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

13

Software Solutions Symposium 2017

Applied ML: Vulnerability Detection

Classifiers

Lasso Logistic Regression

CART

Random Forest

Extreme Gradient Boosting (XGBoost)

Some of the features used

Analysis tools used Tokens in func/method

Significant LOC Alerts in func/method

Complexity Alerts in file

Coupling Methods in file

Cohesion SLOC in file

SEI coding rule Avg Tokens

Function/method length Avg SLOC

SLOC in func/method Depth in code repository

# parameters in func/meth.

Cyclomatic complexity (func/meth)

Page 14: Applied Machine Learning in Software Engineering...The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience,

14Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

14

Software Solutions Symposium 2017

Applied ML: Vulnerability Detection

Significant improvement!

• 91% Classifier accuracy overall

• Specific rule accuracy at right

• 10x developer time saved!

Rule ID

INT31-CEXP01-JOBJ03-JFIO04-J*EXP33-C*EXP34-C*DCL36-C*ERR08-J*IDS00-J*ERR01-J*ERR09-J*

* Small quantity of data

XGBoost97%74%83%80%83%72%

100%100%96%

100%88%

Page 15: Applied Machine Learning in Software Engineering...The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience,

15Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

15

Software Solutions Symposium 2017

Title of the Presentation Goes Here© 2017 Carnegie Mellon University

[DISTRIBUTION STATEMENT Please copy and paste the appropriate distribution statement into this space.]

Applied ML:Malware family classification

Page 16: Applied Machine Learning in Software Engineering...The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience,

16Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

16

Software Solutions Symposium 2017

Applied ML: Malware family classification

Reverse Engineering

Discovery

Refinement

Reflection

File

New Family

Artifact Catalog

Signature 1

Signature 2

Signature 3…

Files 1a 1b 1c 1d …Files 2a 2b 2c 2d …Files 3a 3b 3c 3d …

Which files hang

together?

Page 17: Applied Machine Learning in Software Engineering...The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience,

18Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

18

Software Solutions Symposium 2017

Applied ML: Malware family classification

Simplify visualization of extremely complex data through the use of dimensionality reduction and associated visualization techniques

Chernoff face experiment

t-SNE

Page 18: Applied Machine Learning in Software Engineering...The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience,

19Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

19

Software Solutions Symposium 2017

Applied ML: Malware family classification• Discovery found 23 files• Manual reverse

engineering was slow: only two files in 5 days

• Visualizing files with Chernoff facesimmediately suggests groups of related files.

• Analysis burden shifts from forming candidate groups to verifying groups

• Faster and cheaper = happy clients

Page 19: Applied Machine Learning in Software Engineering...The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience,

20Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

20

Software Solutions Symposium 2017

Title of the Presentation Goes Here© 2017 Carnegie Mellon University

[DISTRIBUTION STATEMENT Please copy and paste the appropriate distribution statement into this space.]

Applied ML:Software cost estimation

Robert Ferguson, Dennis Goldenson, James McCurley, Robert W. Stoddard, David Zubrow, Debra Anderson. “Quantifying Uncertainty in Early Lifecycle Cost Estimation (QUELCE)”. Dec 2011. http://resources.sei.cmu.edu/library/asset-view.cfm?assetid=10039

Page 20: Applied Machine Learning in Software Engineering...The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience,

21Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

21

Software Solutions Symposium 2017

Applied ML: Software cost estimation

General Accounting Office. Defense Acquisitions: A Knowledge-Based Funding Approach Could Improve Major Weapon System Program Outcomes. Report to the Committee on Armed Services, U.S. Senate, July 2008, GAO-08-619.

Page 21: Applied Machine Learning in Software Engineering...The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience,

22Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

22

Software Solutions Symposium 2017

Applied ML: Software cost estimation

Page 22: Applied Machine Learning in Software Engineering...The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience,

23Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

23

Software Solutions Symposium 2017

Applied ML: Software cost estimation

Page 23: Applied Machine Learning in Software Engineering...The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience,

24Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

24

Software Solutions Symposium 2017

Applied ML: Software cost estimation

Page 24: Applied Machine Learning in Software Engineering...The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience,

25Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

25

Software Solutions Symposium 2017

Contact Information

Eliezer KanalTechnical ManagerTelephone: +1 412.268.5204Email: [email protected]