Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
1Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
1
Software Solutions Symposium 2017
Software Engineering InstituteCarnegie Mellon UniversityPittsburgh, PA 15213
Applied Machine Learning in Software Engineering© 2017 Carnegie Mellon University
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
Applied Machine Learning in Software EngineeringEliezer Kanal
2Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
2
Software Solutions Symposium 2017
Copyright 2017 Carnegie Mellon University
This material is based upon work funded and supported by the Department of Defense under Contract No. FA8721-05-C-0003 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center.
NO WARRANTY. THIS CARNEGIE MELLON UNIVERSITY AND SOFTWARE ENGINEERING INSTITUTE MATERIAL IS FURNISHED ON AN “AS-IS” BASIS. CARNEGIE MELLON UNIVERSITY MAKES NO WARRANTIES OF ANY KIND, EITHER EXPRESSED OR IMPLIED, AS TO ANY MATTER INCLUDING, BUT NOT LIMITED TO, WARRANTY OF FITNESS FOR PURPOSE OR MERCHANTABILITY, EXCLUSIVITY, OR RESULTS OBTAINED FROM USE OF THE MATERIAL. CARNEGIE MELLON UNIVERSITY DOES NOT MAKE ANY WARRANTY OF ANY KIND WITH RESPECT TO FREEDOM FROM PATENT, TRADEMARK, OR COPYRIGHT INFRINGEMENT.
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
This material may be reproduced in its entirety, without modification, and freely distributed in written or electronic form without requesting formal permission. Permission is required for any other use. Requests for permission should be directed to the Software Engineering Institute at [email protected].
Carnegie Mellon® and CERT® are registered marks of Carnegie Mellon University.
DM-0004563
3Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
3
Software Solutions Symposium 2017
Tom Mitchell, former CMU Machine Learning department chair:
The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience, and what are the fundamental laws that govern all learning processes?”
Machine Learning seeks to automate data analysis and inference.
What is Machine Learning?
4Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
4
Software Solutions Symposium 2017Picture
(optional)What is Machine Learning?
If your problem can be stated as either of the following:
…you would likely benefit from machine learning.
I would like to use _____datato predict _____.I would like to use ____ data
to guess what ____ is.
5Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
5
Software Solutions Symposium 2017
What is Machine Learning?
Sample Techniques:
• Regression
• K-Means Clustering
Clustering image: Weston.pace, https://commons.wikimedia.org/wiki/File:K_Means_Example_Step_4.svg
6Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
6
Software Solutions Symposium 2017
What is Machine Learning?
Feature Engineering:Using existing data to create more informative data
Data Types
Image StaticVideo
Time series Financial dataEvent counts
Structured text Web formsStructured data
(JSON, XML)Source code
Free text NewsTweetsEmail
many more…
7Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
7
Software Solutions Symposium 2017Picture
(optional)What is Machine Learning?
Examples:
o I would like to use incident ticket data to predictcustomer needs .
o I would like to use publicly available code to predict what code I will write .
o I would like to use bug report data to guess the location of undetected bugs in my code .
8Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
8
Software Solutions Symposium 2017Picture
(optional)
9Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
9
Software Solutions Symposium 2017Picture
(optional)What is Machine Learning?
Examples:
o I would like to use incident ticket data to predictcustomer needs .
o I would like to use publicly available code to predict what code I will write .
o I would like to use bug report data to guess the location of undetected bugs in my code .
10Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
10
Software Solutions Symposium 2017
Title of the Presentation Goes Here© 2017 Carnegie Mellon University
[DISTRIBUTION STATEMENT Please copy and paste the appropriate distribution statement into this space.]
Applied ML:Vulnerability Detection
11Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
11
Software Solutions Symposium 2017
Applied ML: Vulnerability Detection
Analyzer
Analyzer
Analyzer
Codebases
Alerts Today
3,147
11,772
48,690
0
10,000
20,000
30,000
40,000
50,000
60,000
TP FP Susp
Many alerts left unaudited!
12Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
12
Software Solutions Symposium 2017
Applied ML: Vulnerability Detection
Analyzer
Analyzer
Analyzer
Codebases
Alerts Today
Our Goal
3,147
11,772
48,690
0
10,000
20,000
30,000
40,000
50,000
60,000
TP FP Susp
66 effort days
12,076
45,172
6,361
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
50,000
e-TP e-FP I
Automated Statistical Classifier
• Expected True Positive (e-TP)• Expected False Positive (e-FP) • Indeterminate (I)
13Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
13
Software Solutions Symposium 2017
Applied ML: Vulnerability Detection
Classifiers
Lasso Logistic Regression
CART
Random Forest
Extreme Gradient Boosting (XGBoost)
Some of the features used
Analysis tools used Tokens in func/method
Significant LOC Alerts in func/method
Complexity Alerts in file
Coupling Methods in file
Cohesion SLOC in file
SEI coding rule Avg Tokens
Function/method length Avg SLOC
SLOC in func/method Depth in code repository
# parameters in func/meth.
Cyclomatic complexity (func/meth)
14Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
14
Software Solutions Symposium 2017
Applied ML: Vulnerability Detection
Significant improvement!
• 91% Classifier accuracy overall
• Specific rule accuracy at right
• 10x developer time saved!
Rule ID
INT31-CEXP01-JOBJ03-JFIO04-J*EXP33-C*EXP34-C*DCL36-C*ERR08-J*IDS00-J*ERR01-J*ERR09-J*
* Small quantity of data
XGBoost97%74%83%80%83%72%
100%100%96%
100%88%
15Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
15
Software Solutions Symposium 2017
Title of the Presentation Goes Here© 2017 Carnegie Mellon University
[DISTRIBUTION STATEMENT Please copy and paste the appropriate distribution statement into this space.]
Applied ML:Malware family classification
16Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
16
Software Solutions Symposium 2017
Applied ML: Malware family classification
Reverse Engineering
Discovery
Refinement
Reflection
File
New Family
Artifact Catalog
Signature 1
Signature 2
Signature 3…
Files 1a 1b 1c 1d …Files 2a 2b 2c 2d …Files 3a 3b 3c 3d …
…
Which files hang
together?
18Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
18
Software Solutions Symposium 2017
Applied ML: Malware family classification
Simplify visualization of extremely complex data through the use of dimensionality reduction and associated visualization techniques
Chernoff face experiment
t-SNE
19Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
19
Software Solutions Symposium 2017
Applied ML: Malware family classification• Discovery found 23 files• Manual reverse
engineering was slow: only two files in 5 days
• Visualizing files with Chernoff facesimmediately suggests groups of related files.
• Analysis burden shifts from forming candidate groups to verifying groups
• Faster and cheaper = happy clients
20Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
20
Software Solutions Symposium 2017
Title of the Presentation Goes Here© 2017 Carnegie Mellon University
[DISTRIBUTION STATEMENT Please copy and paste the appropriate distribution statement into this space.]
Applied ML:Software cost estimation
Robert Ferguson, Dennis Goldenson, James McCurley, Robert W. Stoddard, David Zubrow, Debra Anderson. “Quantifying Uncertainty in Early Lifecycle Cost Estimation (QUELCE)”. Dec 2011. http://resources.sei.cmu.edu/library/asset-view.cfm?assetid=10039
21Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
21
Software Solutions Symposium 2017
Applied ML: Software cost estimation
General Accounting Office. Defense Acquisitions: A Knowledge-Based Funding Approach Could Improve Major Weapon System Program Outcomes. Report to the Committee on Armed Services, U.S. Senate, July 2008, GAO-08-619.
22Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
22
Software Solutions Symposium 2017
Applied ML: Software cost estimation
23Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
23
Software Solutions Symposium 2017
Applied ML: Software cost estimation
24Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
24
Software Solutions Symposium 2017
Applied ML: Software cost estimation
25Applied Machine Learning in Software EngineeringMarch 20–23, 2017© 2017 Carnegie Mellon University
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
25
Software Solutions Symposium 2017
Contact Information
Eliezer KanalTechnical ManagerTelephone: +1 412.268.5204Email: [email protected]