Upload
dataconomy-media
View
77
Download
0
Embed Size (px)
Citation preview
goo.gl/sd7wsm
(view this on your device)
Plan1. Motivation
2. Source code feature engineering
3. The Network
4. Results
5. Other work
MotivationEverything is better with clusters.“
MotivationCustomers buy goods, and software developers write code.
MotivationSo to understand the latter, we need to understand what and how they do
what they do. Feature origins:
• Social networks
• Version control statistics
• History
• Style
• Source code
• Algorithms
• Dependency graph
• Style
Motivation
MotivationLet's check how deep we can drill with source code style ML.
Toy task: binary classification between 2 projects using only the data with the
origin in code style.
Feature engineeringRequirements:
1. Ignore text files, Markdown, etc.
2. Ignore autogenerated files
3. Support many languages with minimal efforts
4. Include as much information about the source code as possible
Feature engineering(1) and (2) are solved by github/linguist and source{d}'s own tool
• Used by GihHub for language bars
• Supports 400+ languages
Feature engineering(3) and (4) are solved by
• Highlights source code (tokenizer)
• Supports 400+ languages (though only 50% intersects with github/linguist)
• ≈90 token types (not all are used for every language)
Feature engineeringPygments example:
# prints "Hello, World!"
if True:
print("Hello, World!")
# prints "Hello, World!"
if True:
print("Hello, World!")
01.
02.
03.
Feature engineeringToken.Comment.Single '# prints "Hello, World!"'Token.Text '\n'Token.Keyword 'if'Token.Text ' 'Token.Name.Builtin.Pseudo 'True'Token.Punctuation ':'Token.Text '\n'Token.Text ' 'Token.Keyword 'print'Token.Punctuation '('Token.Literal.String.Double '"'Token.Literal.String.Double 'Hello, World!'Token.Literal.String.Double '"'Token.Punctuation ')'Token.Text '\n'
01.02.03.04.05.06.07.08.09.10.11.12.13.14.15.
Feature engineering
Feature engineering• Split stream into lines, each line contains ≤40 tokens
• Merge indents
• "One against all" with value length
• Some tokens occupy more than 1 dimension, e.g. Token.Name reflects
naming style
• About 200 dimensions overall
• 8000 features per line, most are zeros
• Mean-dispersion normalization
Feature engineeringThough extracted, names as words may not used in this scheme.
We've checked out two approaches to using this extra information:
1. LSTM sequence modelling (link to presentation)
2. ARTM topic modelling (article in our blog)
Feature engineering
The Networklayer kernel pooling number
convolutional 4x1 2x1 250
convolutional 8x2 2x2 200
convolutional 5x6 2x2 150
convolutional 2x10 2x2 100
all2all 512
all2all 64
all2all output
The NetworkActivation ReLU
Optimizer GD with momentum (0.5)
Learning rate 0.002
Weight decay 0.955
Regularization L2, 0.0005
Weight initialization σ = 0.1
The Network• Merge all project files together, feed 50 LOT (lines of tokens) as a single
sample.
• Does not converge without random shuffling files (sample borders are of
course fixed).
• Batch size is 50.
• Truncate projects by the smallest LOT.
• Fragile to small metaparameter deviations.
The Network• Python3 / Tensorflow / NVIDIA GPU
• Preprocessing is done on Dataproc (Spark)
• Database of features is stored in Cloud Storage
• Sparse matrices ⇒ normalization on the fly
Resultsprojects description size accuracy
Django vs Twisted Web frameworks, Python 800ktok each 84%
Matplotlib vs Bokeh Plotting libraries, Python 1Mtok vs 250ktok 60%
Matplotlib vs Django Plotting libraries, Python 1Mtok vs 800ktok 76%
Django vs Guava Python vs Java 800ktok >99%
Hibernate vs Guava Java libraries 3Mtok vs 800ktok 96%
ResultsConclusion: the network is likely to extract internal similarity in each project
and use it. Just like humans do.
If the languages are different, it is very easy to distinguish projects (at least
because of unique token types).
Results
ResultsProblem: how to get this for a source code network?
Other workGitHub has ≈6M of active users (and 3M after reasonable filtering). If we are
able to extract various features for each, we can cluster them. Visio:
1. Run K-means with K=45000 (using src-d/kmcuda)
2. Run t-SNE to visualize the landscape
BTW, kmcuda implements Yinyang k-means.
Other workArticle.
ASP
ActionScript
Ada
Apex
Apollo Guidance Computer
AppleScript Arc
Arduino
AsciiDoc
AspectJ
Assembly
AutoHotkey
AutoIt
Awk
Batchfile
Brainfuck
C
C#
C++CLIPS
CMake
COBOL
COLLADA
CSSCSV
ChucK
Click
Clojure
CoffeeScript
ColdFusionColdFusion CFC
Common Lisp
Component Pascal
Coq
Csound DocumentCsound Score
Cucumber
Cuda
Cython
D
DIGITAL Command Language
DM
DNS Zone
DTrace
Dart
Diff
EJS
Eagle
Eiffel
Elixir
Elm
Emacs Lisp
Erlang
F#
FORTRAN
ForthFreeMarker
Frege
G-code
GAP
GAS
GLSL
Genshi
Gentoo Ebuild
Gettext Catalog
Gnuplot
Go
Gradle
Graphviz (DOT)
Groff
Groovy
Groovy Server Pages
HCL
HLSL
HTML
HTML+DjangoHTML+ERB
HTML+PHP
HTTP
Haml
Handlebars
HaskellHaxe
IGOR Pro
INI
JFlex
JSON
JSONLD
JSX
Jade
Jasmin
Java
Java Server Pages
JavaScriptJulia
Jupyter Notebook
KiCad
LLVM
Lasso
Less
Lex
LilyPond
Limbo
Linker Script
Linux Kernel Module
LiquidLiterate Haskell
LiveScript
Logos
Lua
M
M4
MAXScript
MUF
Makefile
Markdown
Mathematica
Matlab
Max
MediaWiki
Modelica
Moocode
NSISNetLogo
NewLisp
Nix
OCaml
ObjDump
Objective-CObjective-C++
Objective-J
OpenCL
OpenEdge ABL
OpenSCAD
Org
PAWN
PHP
PLSQL
PLpgSQL
POV-Ray SDL
Pascal
Perl
Perl6
Pickle
PodPostScript
PowerShell
Processing
Prolog
Protocol Buffer
Public Key
Puppet
Pure Data
PureBasic
Python
QML
QMakeR
RAML
RDoc
RHTML
RMarkdownRacket
Ragel in Ruby Host
Raw token data
Ruby
Rust
SAS
SCSS
SMT
SQF
SQL
SQLPL
SRecode Template
SVG
Sass
Scala
Scheme
Scilab
Shell
Slash
Slim
Smali Smarty
SourcePawn
Squirrel
Standard ML
Stata
Stylus
SuperCollider
Swift
SystemVerilog
Tcl
TeX
Text
Textile
Turing
Turtle
TwigTypeScript
Unity3D Asset
VHDLVala
VerilogVimL
Visual Basic
Vue
Wavefront Material Wavefront Object
Web Ontology Language
XML
XProc
XQuery
XS
XSLT
YAML
Yaccedn
mupad
nesC
reStructuredText
xBase
spaces tabs mixed
© source{d} CC-BY-SA 4.0
Other workArticle.
Other work
Before:
After:
Thank youWe are hiring!