30
Source code abstracts classification using CNN Vadim Markovtsev, source{d}

"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

Embed Size (px)

Citation preview

Page 1: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

Source code abstractsclassification using CNN

Vadim Markovtsev, source{d}

Page 2: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

goo.gl/sd7wsm

(view this on your device)

Plan1. Motivation

2. Source code feature engineering

3. The Network

4. Results

5. Other work

Page 3: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

MotivationEverything is better with clusters.“

Page 4: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

MotivationCustomers buy goods, and software developers write code.

Page 5: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

MotivationSo to understand the latter, we need to understand what and how they do

what they do. Feature origins:

• Social networks

• Version control statistics

• History

• Style

• Source code

• Algorithms

• Dependency graph

• Style

Page 6: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

Motivation

Page 7: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

MotivationLet's check how deep we can drill with source code style ML.

Toy task: binary classification between 2 projects using only the data with the

origin in code style.

Page 8: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

Feature engineeringRequirements:

1. Ignore text files, Markdown, etc.

2. Ignore autogenerated files

3. Support many languages with minimal efforts

4. Include as much information about the source code as possible

Page 9: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

Feature engineering(1) and (2) are solved by github/linguist and source{d}'s own tool

• Used by GihHub for language bars

• Supports 400+ languages

Page 10: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

Feature engineering(3) and (4) are solved by

• Highlights source code (tokenizer)

• Supports 400+ languages (though only 50% intersects with github/linguist)

• ≈90 token types (not all are used for every language)

Page 11: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

Feature engineeringPygments example:

# prints "Hello, World!"

if True:

print("Hello, World!")

# prints "Hello, World!"

if True:

print("Hello, World!")

01.

02.

03.

Page 12: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

Feature engineeringToken.Comment.Single '# prints "Hello, World!"'Token.Text '\n'Token.Keyword 'if'Token.Text ' 'Token.Name.Builtin.Pseudo 'True'Token.Punctuation ':'Token.Text '\n'Token.Text ' 'Token.Keyword 'print'Token.Punctuation '('Token.Literal.String.Double '"'Token.Literal.String.Double 'Hello, World!'Token.Literal.String.Double '"'Token.Punctuation ')'Token.Text '\n'

01.02.03.04.05.06.07.08.09.10.11.12.13.14.15.

Page 13: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

Feature engineering

Page 14: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

Feature engineering• Split stream into lines, each line contains ≤40 tokens

• Merge indents

• "One against all" with value length

• Some tokens occupy more than 1 dimension, e.g. Token.Name reflects

naming style

• About 200 dimensions overall

• 8000 features per line, most are zeros

• Mean-dispersion normalization

Page 15: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

Feature engineeringThough extracted, names as words may not used in this scheme.

We've checked out two approaches to using this extra information:

1. LSTM sequence modelling (link to presentation)

2. ARTM topic modelling (article in our blog)

Page 16: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

Feature engineering

Page 17: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

The Networklayer kernel pooling number

convolutional 4x1 2x1 250

convolutional 8x2 2x2 200

convolutional 5x6 2x2 150

convolutional 2x10 2x2 100

all2all 512

all2all 64

all2all output

Page 18: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

The NetworkActivation ReLU

Optimizer GD with momentum (0.5)

Learning rate 0.002

Weight decay 0.955

Regularization L2, 0.0005

Weight initialization σ = 0.1

Page 19: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

The Network• Merge all project files together, feed 50 LOT (lines of tokens) as a single

sample.

• Does not converge without random shuffling files (sample borders are of

course fixed).

• Batch size is 50.

• Truncate projects by the smallest LOT.

• Fragile to small metaparameter deviations.

Page 20: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

The Network• Python3 / Tensorflow / NVIDIA GPU

• Preprocessing is done on Dataproc (Spark)

• Database of features is stored in Cloud Storage

• Sparse matrices ⇒ normalization on the fly

Page 21: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

Resultsprojects description size accuracy

Django vs Twisted Web frameworks, Python 800ktok each 84%

Matplotlib vs Bokeh Plotting libraries, Python 1Mtok vs 250ktok 60%

Matplotlib vs Django Plotting libraries, Python 1Mtok vs 800ktok 76%

Django vs Guava Python vs Java 800ktok >99%

Hibernate vs Guava Java libraries 3Mtok vs 800ktok 96%

Page 22: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

ResultsConclusion: the network is likely to extract internal similarity in each project

and use it. Just like humans do.

If the languages are different, it is very easy to distinguish projects (at least

because of unique token types).

Page 23: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

Results

Page 24: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

ResultsProblem: how to get this for a source code network?

Page 25: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

Other workGitHub has ≈6M of active users (and 3M after reasonable filtering). If we are

able to extract various features for each, we can cluster them. Visio:

1. Run K-means with K=45000 (using src-d/kmcuda)

2. Run t-SNE to visualize the landscape

BTW, kmcuda implements Yinyang k-means.

Page 26: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}
Page 27: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

Other workArticle.

ASP

ActionScript

Ada

Apex

Apollo Guidance Computer

AppleScript Arc

Arduino

AsciiDoc

AspectJ

Assembly

AutoHotkey

AutoIt

Awk

Batchfile

Brainfuck

C

C#

C++CLIPS

CMake

COBOL

COLLADA

CSSCSV

ChucK

Click

Clojure

CoffeeScript

ColdFusionColdFusion CFC

Common Lisp

Component Pascal

Coq

Csound DocumentCsound Score

Cucumber

Cuda

Cython

D

DIGITAL Command Language

DM

DNS Zone

DTrace

Dart

Diff

EJS

Eagle

Eiffel

Elixir

Elm

Emacs Lisp

Erlang

F#

FORTRAN

ForthFreeMarker

Frege

G-code

GAP

GAS

GLSL

Genshi

Gentoo Ebuild

Gettext Catalog

Gnuplot

Go

Gradle

Graphviz (DOT)

Groff

Groovy

Groovy Server Pages

HCL

HLSL

HTML

HTML+DjangoHTML+ERB

HTML+PHP

HTTP

Haml

Handlebars

HaskellHaxe

IGOR Pro

INI

JFlex

JSON

JSONLD

JSX

Jade

Jasmin

Java

Java Server Pages

JavaScriptJulia

Jupyter Notebook

KiCad

LLVM

Lasso

Less

Lex

LilyPond

Limbo

Linker Script

Linux Kernel Module

LiquidLiterate Haskell

LiveScript

Logos

Lua

M

M4

MAXScript

MUF

Makefile

Markdown

Mathematica

Matlab

Max

MediaWiki

Modelica

Moocode

NSISNetLogo

NewLisp

Nix

OCaml

ObjDump

Objective-CObjective-C++

Objective-J

OpenCL

OpenEdge ABL

OpenSCAD

Org

PAWN

PHP

PLSQL

PLpgSQL

POV-Ray SDL

Pascal

Perl

Perl6

Pickle

PodPostScript

PowerShell

Processing

Prolog

Protocol Buffer

Public Key

Puppet

Pure Data

PureBasic

Python

QML

QMakeR

RAML

RDoc

RHTML

RMarkdownRacket

Ragel in Ruby Host

Raw token data

Ruby

Rust

SAS

SCSS

SMT

SQF

SQL

SQLPL

SRecode Template

SVG

Sass

Scala

Scheme

Scilab

Shell

Slash

Slim

Smali Smarty

SourcePawn

Squirrel

Standard ML

Stata

Stylus

SuperCollider

Swift

SystemVerilog

Tcl

TeX

Text

Textile

Turing

Turtle

TwigTypeScript

Unity3D Asset

VHDLVala

VerilogVimL

Visual Basic

Vue

Wavefront Material Wavefront Object

Web Ontology Language

XML

XProc

XQuery

XS

XSLT

YAML

Yaccedn

mupad

nesC

reStructuredText

xBase

spaces tabs mixed

© source{d} CC-BY-SA 4.0

Page 29: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

Other work

Before:

After:

Page 30: "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

Thank youWe are hiring!