Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark &...

Spark & Machine Learning WorkflowsJuliet Hougland @j_houg

Spark Execution Model

Modeling Lifecycle

Historic Data Model Training Persisted

Model Scoring

New Data

Model Result

Model Training

MTraining

Test Data

Model Pipeline: Featurization, Model Fitting

Persisted Model Evaluation

Historic Data

Pipelines

Real ExampleChurn Prediction for a Telco

KS, 128, 415, 382-4657, no, yes, 25, 265.1, 110, 45.07, 197.4, 99, 16.78, 244.7, 91, 11.01, 10, 3, 2.7, 1, False.

OH, 107, 415, 371-7191, no, yes, 26, 161.6, 123, 27.47, 195.5, 103, 16.62, 254.4, 103, 11.45, 13.7, 3, 3.7, 1, False.

NJ, 137, 415, 358-1921, no, no, 0, 243.4, 114, 41.38, 121.2, 110, 10.3, 162.6, 104, 7.32, 12.2, 5, 3.29, 0, False.

OH, 84, 408, 375-9999, yes, no, 0, 299.4, 71, 50.9, 61.9, 88, 5.26, 196.9, 89, 8.86, 6.6, 7, 1.78, 2, False.

OK, 75, 415, 330-6626, yes, no, 0, 166.7, 113, 28.34, 148.3, 122, 12.61, 186.9, 121, 8.41, 10.1, 3, 2.73, 3, False

The Dataset

Scikit-learn Pipelines

from sklearn.ensemble import GradientBoostingClassifier

X, Y = get_data()gbr = GradientBoostingClassifier()X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.2)gbr.fit(X_train, Y_train)Y_predicted =gbr.transform(X_test)

Scikit-learn Pipelinesfrom sklearn.ensemble import GradientBoostingClassifierfrom sklearn.preprocessing import OneHotEncoder

X, Y = get_data()pipeline = Pipeline([ (‘ohe', OneHotEncoder(categorical_features=[0, 20])), ('gbr', GradientBoostingClassifier()),])

X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.2)pipeline.fit(X_train, Y_train)Y_predicted = pipeline.transform(X_test)

Apache Spark MLLib Pipelines

MLLib Pipelinesfrom pyspark.ml.feature import StringIndexerfrom pyspark.ml.feature import VectorAssemblerfrom pyspark.ml import Pipelinefrom pyspark.ml.classification import DecisionTreeClassifier

label_indexer = StringIndexer(inputCol = ‘churned', outputCol = 'label')plan_indexer = StringIndexer(inputCol = ‘intl_plan', outputCol = 'intl_plan_indexed')

assembler = VectorAssembler( inputCols = ['intl_plan_indexed'] + reduced_numeric_cols, outputCol = 'features')classifier = DecisionTreeClassifier(labelCol = 'label', featuresCol = 'features')

pipeline = Pipeline(stages=[plan_indexer, label_indexer, assembler, classifier])

Deploy!

Model Scoring

New Data

Model Result

You have a few options: • Pickle • Joblib • PMML • Custom

Well, how did you save your model?

Insecure Not Portable Big Slow

“Pickles are for delis”

http://pyvideo.org/pycon-us-2014/pickles-are-for-delis-not-software.html

Storing Models as PMML

// Export a Spark MLLib model to a local file in PMML format pipeline.toPMML(“/path/to_my_file.xml”)

// Export a scikit-learn model to a file in PMML format from sklearn2pmml import sklearn2pmml

sklearn2pmml(iris_pipeline, “DecisionTreeIris.pmml", with_repr = True)

Spark PMML Export Supported Models

Distributed Model Fitting

Modeling Lifecycle

Model Scoring

New Data

Model Result

Model Training

MTraining

Test Data

Model Pipeline: Featurization, Model Fitting

Historic Data

MLLib Pipelinesfrom pyspark.ml.feature import StringIndexerfrom pyspark.ml.feature import VectorAssemblerfrom pyspark.ml import Pipelinefrom pyspark.ml.classification import DecisionTreeClassifier

label_indexer = StringIndexer(inputCol = ‘churned', outputCol = 'label')plan_indexer = StringIndexer(inputCol = ‘intl_plan', outputCol = 'intl_plan_indexed')

assembler = VectorAssembler( inputCols = ['intl_plan_indexed'] + reduced_numeric_cols, outputCol = 'features')classifier = DecisionTreeClassifier(labelCol = 'label', featuresCol = 'features')

pipeline = Pipeline(stages=[plan_indexer, label_indexer, assembler, classifier])

Distributed Grid Search

Modeling Lifecycle

Model Scoring

New Data

Model Result

Model Training

MTraining Data

Model Pipeline:

MTraining Data

Model Pipeline:

MTraining Data

Model Pipeline:

MTraining Data

Model Pipeline:

MTraining Data

Model Pipeline:

MTraining Data

Model Pipeline:

Fit multiple models… Serially

from sklearn import ensemblefrom sklearn.grid_search import GridSearchCV

X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y,test_size=0.2)

tuned_parameters = { "n_estimators": [ 300 ], "max_depth" : [ 4 ], "learning_rate": [ 0.01 ], "min_samples_split" : [ 1 ], "loss" : [ 'ls', 'lad' ]}

gbr = ensemble.GradientBoostingClassifier()clf = GridSearchCV(gbr, cv=3, param_grid=tuned_parameters, scoring="median_absolute_error")preds = clf.fit(X_train, y_train)best = clf.best_estimator_

Fit multiple models… in Parallel

from sklearn import ensemblefrom sklearn.grid_search import GridSearchCV

tuned_parameters = { "n_estimators": [ 300, 400, 200 ], "max_depth" : [ 4, 3 ], "learning_rate": [ 0.01, 0.05, 0.001 ], "min_samples_split" : [ 1, 3 ], "loss" : [ 'ls', 'lad' ]}

gbr = ensemble.GradientBoostingClassifier()clf = GridSearchCV(gbr, cv=3, param_grid=tuned_parameters, scoring="median_absolute_error", n_jobs=10, pre_dispatch=2)preds = clf.fit(X_train, y_train)best = clf.best_estimator_

Fit multiple models… Distributed

https://bigdatapix.tumblr.com/

from sklearn import ensemblefrom spark_sklearn import GridSearchCV

tuned_parameters = { "n_estimators": [ 300, 400, 200 ], "max_depth" : [ 4, 3 ], "learning_rate": [ 0.01, 0.05, 0.001 ], "min_samples_split" : [ 1, 3 ], "loss" : [ 'ls', 'lad' ]}

gbr = ensemble.GradientBoostingClassifier()clf = GridSearchCV(gbr, cv=3, param_grid=tuned_parameters, scoring="median_absolute_error")preds = clf.fit(X_train, y_train)best = clf.best_estimator_

Distributed Model Scoring

What do you mean by “Deploy?”

Model Scoring

New Data

Model Result

Scoring with REST Server

Persisted Model

Model Scoring

HTTP Request

HTTP Response

Distributed Batch Model Scoring: With REST server

Distributed Batch Model Scoring

Model Scoring

New Data

Model Result

Distributed Batch Model Scoring: With REST Server

Distributed Batch Model Scoring: With Spark + JPMML

File pmmlFile = ...;

Evaluator evaluator = EvaluatorUtil.createEvaluator(pmmlFile);

TransformerBuilder pmmlTransformerBuilder = new TransformerBuilder(evaluator).withTargetCols().withOutputCols().exploded(false);

Transformer pmmlTransformer = pmmlTransformerBuilder.build();

Modeling Lifecycle

Model Scoring

New Data

Model Result

Juliet Hougland @j_houg

Thank You!

Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark &...

Documents

Managing Workflows - Cisco · Managing Workflows Thischaptercontainsthefollowingsections: • WorkflowEditing,page1 • ExportingandImportingCiscoUCSDirectorArtifacts,page7

Vantage Version 8.0 UP2 Cloud Port User Guide...Vantage Cloud Port Overview Understanding the Types of Vantage Workflows 9 Hosted Workflows Hosted Workflows are workflows exported

Faster ETL Workflows using Apache Pig & Spark€¦ · Faster ETL Workflows using Apache Pig & Spark - Praveen Rachabattuni, Sigmoid Analytics @praveenr019. About me Apache Pig committer

James L. Hougland

Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

ACO Measures and Workflows - summalearner.comsummalearner.com/assets/aco-measures-and-workflows-bulletin... · ACO Measures and Workflows ... ACO Measures and Workflows ACO 21 - Preventive

Spark & Spark SQL

Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

Automating Cisco Spark with Cloud Integration...Development Opportunity Spectrum Full Development Practice Full Development (outsourced) Light Development (Simple workflows) Spark

Workflows in SharePoint 2013. About me We love workflows

Workflows - Cisco · Workflows Thefollowingtopicsdescribehowtouseworkflows: •Overview:Workflows,onpage1 •PredefinedWorkflows,onpage1 •CustomTableWorkflows,onpage11

Easily Streamline Workflows and Integrate Messaging Services into Custom Applications with Spark APIs

Morticia: Visualizing And Debugging Complex Spark Workflows

PySpark Best Practices by Juliet Hougland

WordPress Workflows

MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikhail Semeniuk and Hollin Wilkins

Workflows 3: Graphs, PageRank, Loops, Sparkwcohen/10-605/workflows-3.pdf · Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood 1. The course

Visio Workflows

Workflows, Simplified

Scientific Workflows - University of Washingtonhomes.cs.washington.edu/~billhowe/cs410/lectures/sciwf.pdf · Business Workflows vs. Scientific Workflows • Business Workflows –