"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

Source code abstractsclassification using CNN

Vadim Markovtsev, source{d}

http://sourced.tech/

goo.gl/sd7wsm

(view this on your device)

Plan1. Motivation

2. Source code feature engineering

3. The Network

4. Results

5. Other work

https://goo.gl/sd7wsm

MotivationEverything is better with clusters.“

MotivationCustomers buy goods, and software developers write code.

MotivationSo to understand the latter, we need to understand what and how they do

what they do. Feature origins:

• Social networks

• Version control statistics

• History

• Style

• Source code

• Algorithms

• Dependency graph

• Style

Motivation

MotivationLet's check how deep we can drill with source code style ML.

Toy task: binary classification between 2 projects using only the data with the

origin in code style.

Feature engineeringRequirements:

1. Ignore text files, Markdown, etc.

2. Ignore autogenerated files

3. Support many languages with minimal efforts

4. Include as much information about the source code as possible

Feature engineering(1) and (2) are solved by github/linguist and source{d}'s own tool

• Used by GihHub for language bars

• Supports 400+ languages

https://github.com/github/linguist

https://en.wikipedia.org/wiki/Remember_the_Name

Feature engineering(3) and (4) are solved by

• Highlights source code (tokenizer)

• Supports 400+ languages (though only 50% intersects with github/linguist)

• ≈90 token types (not all are used for every language)

http://pygments.org/

Feature engineeringPygments example:

# prints "Hello, World!"

if True:

print("Hello, World!")

# prints "Hello, World!"

if True:

print("Hello, World!")

01.

02.

03.

Feature engineeringToken.Comment.Single '# prints "Hello, World!"'Token.Text '\n'Token.Keyword 'if'Token.Text ' 'Token.Name.Builtin.Pseudo 'True'Token.Punctuation ':'Token.Text '\n'Token.Text ' 'Token.Keyword 'print'Token.Punctuation '('Token.Literal.String.Double '"'Token.Literal.String.Double 'Hello, World!'Token.Literal.String.Double '"'Token.Punctuation ')'Token.Text '\n'

01.02.03.04.05.06.07.08.09.10.11.12.13.14.15.

Feature engineering

Feature engineering• Split stream into lines, each line contains ≤40 tokens

• Merge indents

• "One against all" with value length

• Some tokens occupy more than 1 dimension, e.g. Token.Name reflects

naming style

• About 200 dimensions overall

• 8000 features per line, most are zeros

• Mean-dispersion normalization

Feature engineeringThough extracted, names as words may not used in this scheme.

We've checked out two approaches to using this extra information:

1. LSTM sequence modelling (link to presentation)

2. ARTM topic modelling (article in our blog)

http://vmarkovtsev.github.io/re-work-2016-london

http://blog.sourced.tech/post/github_topic_modeling/

Feature engineering

The Networklayer kernel pooling number

convolutional 4x1 2x1 250




all2all 512

all2all 64

all2all output

The NetworkActivation ReLU

Optimizer GD with momentum (0.5)

Learning rate 0.002

Weight decay 0.955

Regularization L2, 0.0005

Weight initialization σ = 0.1

The Network• Merge all project files together, feed 50 LOT (lines of tokens) as a single

sample.

• Does not converge without random shuffling files (sample borders are of

course fixed).

• Batch size is 50.

• Truncate projects by the smallest LOT.

• Fragile to small metaparameter deviations.

The Network• Python3 / Tensorflow / NVIDIA GPU

• Preprocessing is done on Dataproc (Spark)

• Database of features is stored in Cloud Storage

• Sparse matrices ⇒ normalization on the fly

https://cloud.google.com/dataproc/

https://cloud.google.com/storage/

Resultsprojects description size accuracy

Django vs Twisted Web frameworks, Python 800ktok each 84%

Matplotlib vs Bokeh Plotting libraries, Python 1Mtok vs 250ktok 60%

Matplotlib vs Django Plotting libraries, Python 1Mtok vs 800ktok 76%

Django vs Guava Python vs Java 800ktok >99%

Hibernate vs Guava Java libraries 3Mtok vs 800ktok 96%

ResultsConclusion: the network is likely to extract internal similarity in each project

and use it. Just like humans do.

If the languages are different, it is very easy to distinguish projects (at least

because of unique token types).

Results

ResultsProblem: how to get this for a source code network?

Other workGitHub has ≈6M of active users (and 3M after reasonable filtering). If we are

able to extract various features for each, we can cluster them. Visio:

1. Run K-means with K=45000 (using src-d/kmcuda)

2. Run t-SNE to visualize the landscape

BTW, kmcuda implements Yinyang k-means.

https://github.com/src-d/kmcuda

https://lvdmaaten.github.io/tsne/

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/ding15.pdf

Other workArticle.

ASP

ActionScript

Ada

Apex

Apollo Guidance Computer

AppleScript Arc

Arduino

AsciiDoc

AspectJ

Assembly

AutoHotkey

AutoIt

Awk

Batchfile

Brainfuck

C

C#

C++CLIPS

CMake

COBOL

COLLADA

CSSCSV

ChucK

Click

Clojure

CoffeeScript

ColdFusionColdFusion CFC

Common Lisp

Component Pascal

Coq

Csound DocumentCsound Score

Cucumber

Cuda

Cython

D

DIGITAL Command Language

DM

DNS Zone

DTrace

Dart

Diff

EJS

Eagle

Eiffel

Elixir

Elm

Emacs Lisp

Erlang

F#

FORTRAN

ForthFreeMarker

Frege

G-code

GAP

GAS

GLSL

Genshi

Gentoo Ebuild

Gettext Catalog

Gnuplot

Go

Gradle

Graphviz (DOT)

Groff

Groovy

Groovy Server Pages

HCL

HLSL

HTML

HTML+DjangoHTML+ERB

HTML+PHP

HTTP

Haml

Handlebars

HaskellHaxe

IGOR Pro

INI

JFlex

JSON

JSONLD

JSX

Jade

Jasmin

Java

Java Server Pages

JavaScriptJulia

Jupyter Notebook

KiCad

LLVM

Lasso

Less

Lex

LilyPond

Limbo

Linker Script

Linux Kernel Module

LiquidLiterate Haskell

LiveScript

Logos

Lua

M

M4

MAXScript

MUF

Makefile

Markdown

Mathematica

Matlab

Max

MediaWiki

Modelica

Moocode

NSISNetLogo

NewLisp

Nix

OCaml

ObjDump

Objective-CObjective-C++

Objective-J

OpenCL

OpenEdge ABL

OpenSCAD

Org

PAWN

PHP

PLSQL

PLpgSQL

POV-Ray SDL

Pascal

Perl

Perl6

Pickle

PodPostScript

PowerShell

Processing

Prolog

Protocol Buffer

Public Key

Puppet

Pure Data

PureBasic

Python

QML

QMakeR

RAML

RDoc

RHTML

RMarkdownRacket

Ragel in Ruby Host

Raw token data

Ruby

Rust

SAS

SCSS

SMT

SQF

SQL

SQLPL

SRecode Template

SVG

Sass

Scala

Scheme

Scilab

Shell

Slash

Slim

Smali Smarty

SourcePawn

Squirrel

Standard ML

Stata

Stylus

SuperCollider

Swift

SystemVerilog

Tcl

TeX

Text

Textile

Turing

Turtle

TwigTypeScript

Unity3D Asset

VHDLVala

VerilogVimL

Visual Basic

Vue

Wavefront Material Wavefront Object

Web Ontology Language

XML

XProc

XQuery

XS

XSLT

YAML

Yaccedn

mupad

nesC

reStructuredText

xBase

spaces tabs mixed

© source{d} CC-BY-SA 4.0

http://blog.sourced.tech/post/tab_vs_spaces/

Other workArticle.

http://blog.sourced.tech/post/github_topic_modeling/

Other work

Before:

After:

Thank youWe are hiring!

Data & Analytics

"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}