40
Machine Learning for Large Scale Code Analysis This is source{d}. Vadim Markovtsev

This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Machine Learning forLarge Scale Code Analysis

This is source{d}.Vadim Markovtsev

Page 2: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Who I am

2

Page 3: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Vadim Markovtsev● Machine Learning team leader● Joined source{d} in mid-2016● Worked in 5 software companies, e.g. Samsung Research, Mail.Ru● Spoke 30+ times on IT conferences, from meetups to AAA● Master in Applied Mathematics (Moscow University of Physics and

Technology, 2012)

3

Page 4: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

How we work

4

Page 5: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Who we are

● 35+ employees● 70% are engineers● 15% are management● Major cultures: Hispanic,

Slavic, French5

Page 6: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Timeline● Early 2015: appear as Tyba, seed investments● Late 2015: pivot, become source{d} (new CTO's side project)● Early 2016: run out of money, some employees leave● Mid 2016: raise €5mm, salvation● Late 2016: pivot, stop making money and dismiss 50%● 2017: let's clone all Git repositories, parse them, and do MLonCode● Late 2017: remote first● 2018: assisted code review with MLonCode● Early 2019: engineering observability… with MLonCode● Mid 2019: first revenue and raise €Xmm

6

Page 7: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Human resources

7

Number of skilled engineers

Time

High-risk, high-profit, innovation

Low-risk, no extra profit, optimization

source{d} right now

Page 8: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Offices● HQ: Madrid and San Francisco● 60% are remote● Flexible everything

8

Page 9: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Perks● 23 days of vacations● Open Source Day● Research Day● Conferences● Papers

9

Page 10: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Teams● Applications● Solutions● Machine Learning, applied● Machine Learning, research● Data Processing● Data Retrieval● Language Analysis● Developer Operations● Developer Relations● Quality Assurance

10

● Product management● Business Intelligence● Management

Page 11: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Structure

11

Management (CEO, CTO, etc.) VP of engineering

Team Lead

Coordinator Engineer

Product Manager

Page 12: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Hiring● Remote coding challenge● Remote Machine Learning challenge*● Personal with CTO or VP of Engineering● Personal with CEO (sometimes)● Design interview● Machine Learning interview*● Q&A interview● Logical Thinking interview

Open discussion with veto-ing and overriding by the team lead

12

Page 13: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

github.com/src-d/guide13

Page 14: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Culture● 4 full-time engineers left since mid-2016● 1 joined back

○ He became our VP of engineering

14

Page 15: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Technologies

15

Page 16: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Programming languages

1. Go2. Python 33. Scala4. Javascript, Typescript5. C/C++6. CUDA

16

Page 17: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Our engineers play with

● Rust● Elm● D● Haskell

17

Page 18: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Tools

● Git, GitHub● Linux, macOS● Visual Studio Code● PyCharm, GoLand, CLion● vim● Ghost● Gimp, Inkscape● shwr.me

18

● Slack● Google Docs● Zoom, appear.in● Octobox

Page 19: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

If we go deeper

● Indent with spaces● Switched to Go modules, no vendoring● go-git● Python scientific stack● Tensorflow and Pytorch● Apache Spark● React● Kubernetes, Docker

19

Page 20: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Clouds

● Amazon● Azure (Microsoft)● Google● etc.

20

➖ Expensive➖ Support sucks

➕ X × $10,000 for startups➕ Awesome quality: reliable, performant, etc.

Page 21: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Google21

Page 22: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

GitHub22

Page 23: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Computer science stuff we did use

● Graph traversal● Connected components, shortest path, etc.● Linear Programming: bipartite matching, network flow, and many others● Convex optimization● Grammar parsers● Complexity theory● String algorithms, e.g. LCS with suffix arrays● Compression theory● Disjoint sets

23

Force yourself to study the theory, this is your competitive advantage

● Merkle tree● Dynamic Programming● Max-SAT● ...

Page 24: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Machine Learning24

Page 25: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Natural Language Processing

● Working with word distributions● Pipelines● word2vec; Swivel● Transformers

25

Page 26: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Classics

● Linear Regression● Random Tree Forest● Production Rules● GBDT: xgboost, catboost● Hyperparameter optimization

26

Page 27: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Clustering

● K-Means● t-SNE● UMAP● k-NN, e.g. KD Tree, VP Tree● aNN, e.g. hnsw● MinHash, Weighted MinHash

27

Page 28: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Deep Learning

● Char-level CNN● Transformers, Inception● LSTM, GRU● Gated Graph Neural Networks

28

Page 29: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

What we do

29

Page 30: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Code as Data

● "Code Lake"● Dashboards● Advanced Insights

Engineering Observability

30

Page 31: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Machine Learning on Source Code

● Automatic Program Repair● Code Naturalness

Assisted Code Review

31

Page 32: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Challenges

32

Page 33: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Remote communication33

Page 34: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Talent34

Page 35: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Transparency35

Page 36: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Pioneering36

Page 37: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Academia37

Page 38: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Sales38

Page 39: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Scaling39

Page 40: This is source{d}. - vmarkovtsev.github.iovmarkovtsev.github.io/SUPCOM.pdf · Git, GitHub Linux, macOS Visual Studio Code PyCharm, GoLand, CLion vim Ghost Gimp, Inkscape shwr.me 18

Machine Learning for Large Scale Code Analysis

sourced.tech · github.com/src-d · @sourcedtech · blog.sourced.tech