Making ML more useful to more people · ML.NET: An open source and cross-platform machine learning...

Preview:

Citation preview

Markus Weimer

Markus.Weimer@Microsoft.com

Making ML more useful to more people

“It has exquisite buttons …

with long sleeves …works for

casual as well as business

settings”{f(x) {f(x)

Why Machine Learning?“Programming the UnProgrammable”

Point of view: Data Science is Software Engineering with Data

Models are Software

• Built as software, just with different tools

• Deployed and updated as software

• Tested as software

• Debugged like software

Training data needs management

• Data is private and increasingly regulated

• Data is dynamic (CRUD, retention policies, …)

• Best managed as part of the data estate

• Training and deployment of models needs to respect data governance

https://dot.net/ml

ML.NET

Brought to you by (amongst others)

Zeeshan Ahmed (Microsoft) zeahmed@microsoft.com, Saeed Amizadeh (Microsoft) <saamizad@microsoft.com>, Mikhail Bilenko (Yandex) <mbilenko@yandex-team.ru>, Rogan Carr (Microsoft) <rocarr@microsoft.com>, Wei-Sheng Chin (Microsoft) <WeiSheng.Chin@microsoft.com>, Yael Dekel (Microsoft) <yaeld@microsoft.com>, Xavier Dupre (Microsoft) <xadupre@microsoft.com>, Vadim Eksarevskiy (Microsoft) <Vadim.Eksarevskiy@microsoft.com>, Senja Filipi (Microsoft) <sefilipi@microsoft.com>, Tom Finley (Microsoft) <tfinley@microsoft.com>, Abhishek Goswami (Microsoft) <agoswami@microsoft.com>, Monte Hoover (Microsoft) <Monte.Hoover@microsoft.com>, Scott Inglis (Microsoft) <singlis@microsoft.com>, Matteo Interlandi (Microsoft) <mainterl@microsoft.com>, Najeeb Kazmi (Microsoft) <nakazmi@microsoft.com>, Gleb Krivosheev (Microsoft) <gleb.krivosheev@skype.net>, Pete Luferenko (Microsoft) <Pete.Luferenko@microsoft.com>, Ivan Matantsev (Microsoft) <ivmatan@microsoft.com>, Sergiy Matusevych (Microsoft) <sergiym@microsoft.com>, Shahab Moradi (Microsoft) <shmoradi@microsoft.com>, Gani Nazirov (Microsoft) <ganaziro@microsoft.com>, Justin Ormont (Microsoft) <Justin.Ormont@microsoft.com>, Gal Oshri (Microsoft) <gaoshri@microsoft.com>, Artidoro Pagnoni (Microsoft) <Artidoro.Pagnoni@microsoft.com>, Jignesh Parmar (Microsoft) <jignparm@microsoft.com>, Prabhat Roy (Microsoft) <Prabhat.Roy@microsoft.com>, Zeeshan Siddiqui (Microsoft) <mzs@microsoft.com>, Markus Weimer (Microsoft) <mweimer@microsoft.com>, Shauheen Zahirazami (Microsoft) <shzahira@microsoft.com>, Yiwen Zhu (Microsoft) <zhu.yiwen@microsoft.com>, …

About .NET

• .NET has cool stuff ML people care about

• C#: Like Java, but from the future

• F#: Like Python, but with static types and multithreading

• Almost-free calls into native code

• .NET is OSS and cross platform

• Windows (surprise!), Linux, macOS

• Phones via Xamarin: Android, iOS

• Interesting HW: Xbox, IoT devices, …

• Lots of developers build important stuff in .NET

• 4M active; 450k added each month

• 15% growth MoM in https://github.com/dotnet

• Half the top-10k websites are built in .NET

.NET

ML.NET: An open source and cross-platform machine learning framework

Machine Learning made for .NET Developers

Covers many developer scenarios

Available in C#, F# and VB.NET

Open source and cross-platformWindows, Linux, Mac

X64, x86 (some), ARM (some)

Proven and extensibleDevelopment started ~10 years ago

Received contribution (and scrutiny) from all over Microsoft

This designed most of my slides used today ☺

ML.NET is used in many products

• Many MS products use TLC ML.NET.

• You have likely used ML.NET today ☺

• Why is that?

• Many products are written in (ASP).NET

• Using ML.NET is just like using any other .NET API

var model = mlContext.Model.Load(“mymodel.zip”);

var predFunc = trainedModel.MakePredictionFunction<T_IN, T_OUT>(mlContext);

var result = predFunc.Predict(x);

Using a model is just like using codeResource

shipped with the app.

Standard software

dependency

Training: Think sklearn, but with a statically typed language

ML.NET captures end-to-end Machine Learning Pipelines

Data Ingestion

Text

SQL

In Memory

Featurization and Transforms

Text & Image featurization

Pre-trained DNNs in ONNX, TensorFlow

Feature transforms (normalization, pruning, …)

Learning Algorithms

Supervised: Linear, Trees, Factorization Machines, …

Unsupervised: PCA, LDA, K-Means, …

Time Series

ML.NET is fast & good

• Core infrastructure: IDataView

• Carefully designed to avoid memory allocations

• Only required data is lazily materialized

• Carefully tuned defaults

• Many ML tasks are more alike than we’d like to admit ☺

GBDT Experiments done on Criteo, using default parameters

ML.NET’s journey to OSS

• Developed for almost a decade as an internal tool

• Open Sourced in May 2018 (at //build)

• MIT License, .NET Foundation

• Monthly releases ever since; 1.0rc1 this Tuesday

• Please check it out, and leave feedback

Other efforts not discussed today

• Pretzel

• Model compiler

• Especially good at the many models → one program problem

• http://www.markusweimer.com/publication/2018/10/23/pretzel/

• TorchSharp

• PyTorch – Python + .NET

• https://github.com/xamarin/TorchSharp

Distributed Machine Learning where the Data is

• One cluster used by allworkloads (interactive, batch, streaming, …)

• Resources are handed out as containers• A container is slice of a

machine• Fixed RAM, CPU, I/O, …

• Examples:• Azure Batch• Apache Hadoop YARN• Apache Mesos• Google Borg

Resource Managers

Container

• Fault tolerance

• Pre-emption

• Elasticity

Challenges

• ML thrives with gang scheduling• Iterative • Fixed data sets

• Gangs are undesirable on shared clusters• Utilization is paramount• MPI: Wait …• MapReduce: Do the work

slowly on fewer machines

• Let’s do better than that

Machine learning

Approach I: Elastic MLNeurIPS ‘14

• Our solution:• Ramp up the workload

with the allocations

• In each iteration, add machines and data

• First iteration

Elastic ML

• Our solution:• Ramp up the workload

with the allocations

• In each iteration, add machines and data

• Second Iteration

Elastic ML

• Our solution:• Ramp up the workload

with the allocations

• In each iteration, add machines and data

• End state

Elastic ML

Is it any good?

Approach II: Coded computingYaoqing Yang (CMU), Matteo Interlandi, Saeed Amizadeh

NeurIPS ’18, ongoing work

Coded DataOriginal Data

Or: Coded Computing

Container 1 Container 2 Container 3 Container 4 Container 5 Container 6

X[1] X[2] X[3] X[1]+2x[2]+3X[3] X[1]+4X[2]+9X[3]

X[1]+8X[2]+27X[3]

Y[1] Y[2] Y[3] Y[1]+2Y[2]+3Y[3] Y[1]+4Y[2]+9Y[3] Y[1]+8Y[2]+27Y[3]

… … … … … …

… … … … … …

… … … … … …

• Encode 3 splits into 6 splits

• Any 3 row bloks out of 6 are sufficient

Results

• Real dataset: 100,000 samples, 3352 Features.

• Distributed computing on 20 machines.

• Randomly pick 10 machines and let them randomly fail during the computation.

Point of view: Data Science is Software Engineering with Data

Models are Software

• Built as software, just with different tools

• Deployed and updated as software

• Tested as software

• Debugged like software

Training data needs management

• Data is private and increasingly regulated

• Data is dynamic (CRUD, retention policies, …)

• Best managed as part of the data estate

• Training and deployment of models needs to respect data governance

Many open questions

• For software, we have source control. For data and models we have …?

• For software, we have code reviews. For data we have … ?

• For software, we have semantic versions, for data we have … ?

• For software, we have debuggers. For models, we have … ?

• For software, we have signing. For models, we have … ?

• …

Thanks for your time!Let’s stay in touch!

ML.NET is ML for .NEThttps://dot.net/ml

https://github.com/dotnet/machinelearning

You can reach me at:Markus.Weimer@Microsoft.com

@MarkusWeimer

http://markusweimer.com

Of course, we are hiring

Recommended