16
MLeap: Release Spark ML Pipelines Hollin Wilkins & Mikhail Semeniuk

Spark Summit East 2016 - MLeap Presentation

Embed Size (px)

Citation preview

Page 1: Spark Summit East 2016 -   MLeap Presentation

MLeap: Release Spark ML Pipelines

Hollin Wilkins & Mikhail Semeniuk

Page 2: Spark Summit East 2016 -   MLeap Presentation

Introduction

Studied some Math and Economics

- Predictive modeling of at risk patients @ UHG

- R/SAS batch pipelines

- - Joins TrueCar to do new car price modeling

- SQL-based batch pipelines + python for Linear Regressions in API layer

- Continues to train models in SAS/R

- SQL-based batch pipelines + python for Linear Regressions in API layer

- Massive pre-computation in Hadoop/ElasticSearch

- Portions of models still translated, now to Java

- Spark enables data processing and training of models in same environment

MLeap

Web Dev @ Cornell

Studied some General Biology

Rails Consulting for TrueCar and other companies

Implement ML model for ClearBook in Ruby, Python and Java

Erlang for highly-concurrent game servers

C/C# Game Engines

Join TrueCar for Rails/mobile development

Begin work on machine learning platform based on Spark/Hadoop

How do we build real-time APIs based on Spark Models? SparkContext prohibitive

Try several ideas, fail, learn a ton, begin collaboration with Hortonworks

Page 3: Spark Summit East 2016 -   MLeap Presentation

Everyone wants to do better! The winning technology will be the one that enables Engineers and Data Scientists to collaborate and work across a single platform.

Action Reaction

Problem Statement: Deploying machine learning algorithms to a production environment is a lot more difficult than it has to be and is a common problem for data-driven organizations.

- Data scientists write data pipelines to construct research datasets

- Engineers re-write the data pipelines for a production-ready system

- Engineers write scalable libraries for computing features and algorithms

- Data scientists largely don’t use those libraries and maintain/re-write their own copy of the code

- Data scientists largely focus on linear/logistic regressions due to engineering constraints

- Talented engineers get largely tired of coding up linear regressions and updating coefficients

Page 4: Spark Summit East 2016 -   MLeap Presentation

HDFS: Shared Data Spark: Shared ML APIs ?: Shared API Environment

Existing Solutions: You won’t believe how many companies are still deploying algorithms in a SQL environment! And these are billion dollar operations.

Hard-Coded Models(SQL, Java, Ruby)

PMML Emerging Solutions(yHat, DataRobot)

Enterprise Solutions(Microsoft, IBM, SAS)

MLeap

Quick to Implement

Open Sourced

Committed to Spark/Hadoop

API Server

Page 5: Spark Summit East 2016 -   MLeap Presentation

MLeap Solution• Born out of need to deploy models to a real time API

server at TrueCar

• Leverage Hadoop/Spark ecosystem for training, get rid of Spark dependency for execution

• Easily reuse models with serialization and executing without Spark

Page 6: Spark Summit East 2016 -   MLeap Presentation

MLeap Components• core - provides linear algebra system,

regression models, and feature builders

• runtime - provides DataFrame-like “LeapFrame” and transformers for it

• spark - provides easy conversion from Spark transformers to MLeap transformers

mleap-spark

mleap-runtime

mleap-core

Page 7: Spark Summit East 2016 -   MLeap Presentation

Linear Algebra

Dense/Sparse Vectors

BLAS from Spark

Feature Builders

VectorAssembler

StringIndexer

Regressions

LinearRegression

RandomForest

OneHotEncoder

StandardScaler

LogisticRegression

MLeap Core Components

Page 8: Spark Summit East 2016 -   MLeap Presentation

MLeap Runtime• Provides LeapFrame, which stores data for transformations by MLeap

transformers

• MLeap transformers use mleap-core building blocks to transform LeapFrame

• MLeap transformers correspond one-to-one with Spark transformers

• No dependencies on Spark

Page 9: Spark Summit East 2016 -   MLeap Presentation

Feature Pipeline DiagramVectorAssembler Continuous

Feature Vector StandardScaler

StringIndexer

StringIndexer

StringIndexer

OneHotEncoder

OneHotEncoder

VectorAssembler

LinearRegression

Categorical Feature

Categorical FeatureIndex

Categorical Feature

One Hot Vector

Categorical Feature Vector

VectorAssembler

Scaled Continuous Feature Vector

Final Feature Vector

Continuous Feature

Legend

Final Feature Vector Prediction

Regression Pipeline

OneHotEncoder

Page 10: Spark Summit East 2016 -   MLeap Presentation

Categorical Pipeline

LeapFrame LeapFrame LeapFrame

Categorical Feature StringIndexer OneHotEncoderCategorical

Feature Index

Categorical Feature One Hot Vector

StringIndexer OneHotEncoder

Page 11: Spark Summit East 2016 -   MLeap Presentation

String Indexer Model

One Hot Encoder Model

Linear Regression Model

Page 12: Spark Summit East 2016 -   MLeap Presentation

MLeap Spark• Train an ML pipeline with Spark then export it to MLeap

• Execute an MLeap pipeline against a Spark DataFrame

Spark Estimator Spark Model MLeap Model

MLeap Spark

Spark DataFrame Spark LeapFrame Spark LeapFrame

MLeap Spark

Spark DataFrame

MLeap Transformer

MLeap Spark

Page 13: Spark Summit East 2016 -   MLeap Presentation

Typical MLeap Workflow

DataFrames

HDFS: Raw Data

Spark/MapReduce/Pig/Hive

Combine + Enrich Transactional Data

Compute the feature vector

Train, Test, Repeat(MLeap-Spark)

Export models to …?

Spark MLeap Runtime

Serialize to HDFS

Serialize to ElasticSearch

Model 1 Model 2

Deploy to API Servers

Scala/Java/Python/Zeppelin/Jupyter/DataBricks

Page 14: Spark Summit East 2016 -   MLeap Presentation

Demo Usage of MLeap• Train a sample listing price model using linear

regression and random forest

• Deploy both models to an API server

• Get real-time results

• IN UNDER 5 MINUTES!

Page 15: Spark Summit East 2016 -   MLeap Presentation

Future of MLeap• Have MLeap Core be a dependency of Spark to ensure

parity between transformers

• Expand supported transformers

• Python/R interface

Page 16: Spark Summit East 2016 -   MLeap Presentation

THANK YOU.Github: http://github.com/truecar/mleapHollin Wilkins: [email protected] Semeniuk: [email protected]