Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark &...

Preview:

Citation preview

‹#›© Cloudera, Inc. All rights reserved.

Spark & Machine Learning WorkflowsJuliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.

‹#›© Cloudera, Inc. All rights reserved.

Spark Execution Model

‹#›© Cloudera, Inc. All rights reserved.

Modeling Lifecycle

Historic Data Model Training Persisted

Model

Model Scoring

New Data

Model Result

‹#›© Cloudera, Inc. All rights reserved.

Model Training

MTraining

Data

Test Data

Model Pipeline: Featurization, Model Fitting

Persisted Model Evaluation

Historic Data

‹#›© Cloudera, Inc. All rights reserved.

Pipelines

‹#›© Cloudera, Inc. All rights reserved.

Real ExampleChurn Prediction for a Telco

‹#›© Cloudera, Inc. All rights reserved.

‹#›© Cloudera, Inc. All rights reserved.

‹#›© Cloudera, Inc. All rights reserved.

KS, 128, 415, 382-4657, no, yes, 25, 265.1, 110, 45.07, 197.4, 99, 16.78, 244.7, 91, 11.01, 10, 3, 2.7, 1, False.

OH, 107, 415, 371-7191, no, yes, 26, 161.6, 123, 27.47, 195.5, 103, 16.62, 254.4, 103, 11.45, 13.7, 3, 3.7, 1, False.

NJ, 137, 415, 358-1921, no, no, 0, 243.4, 114, 41.38, 121.2, 110, 10.3, 162.6, 104, 7.32, 12.2, 5, 3.29, 0, False.

OH, 84, 408, 375-9999, yes, no, 0, 299.4, 71, 50.9, 61.9, 88, 5.26, 196.9, 89, 8.86, 6.6, 7, 1.78, 2, False.

OK, 75, 415, 330-6626, yes, no, 0, 166.7, 113, 28.34, 148.3, 122, 12.61, 186.9, 121, 8.41, 10.1, 3, 2.73, 3, False

The Dataset

‹#›© Cloudera, Inc. All rights reserved.

Scikit-learn Pipelines

from sklearn.ensemble import GradientBoostingClassifier

X, Y = get_data()gbr = GradientBoostingClassifier()X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.2)gbr.fit(X_train, Y_train)Y_predicted =gbr.transform(X_test)

‹#›© Cloudera, Inc. All rights reserved.

Scikit-learn Pipelinesfrom sklearn.ensemble import GradientBoostingClassifierfrom sklearn.preprocessing import OneHotEncoder

X, Y = get_data()pipeline = Pipeline([ (‘ohe', OneHotEncoder(categorical_features=[0, 20])), ('gbr', GradientBoostingClassifier()),])

X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.2)pipeline.fit(X_train, Y_train)Y_predicted = pipeline.transform(X_test)

‹#›© Cloudera, Inc. All rights reserved.

Apache Spark MLLib Pipelines

‹#›© Cloudera, Inc. All rights reserved.

MLLib Pipelinesfrom pyspark.ml.feature import StringIndexerfrom pyspark.ml.feature import VectorAssemblerfrom pyspark.ml import Pipelinefrom pyspark.ml.classification import DecisionTreeClassifier

label_indexer = StringIndexer(inputCol = ‘churned', outputCol = 'label')plan_indexer = StringIndexer(inputCol = ‘intl_plan', outputCol = 'intl_plan_indexed')

assembler = VectorAssembler( inputCols = ['intl_plan_indexed'] + reduced_numeric_cols, outputCol = 'features')classifier = DecisionTreeClassifier(labelCol = 'label', featuresCol = 'features')

pipeline = Pipeline(stages=[plan_indexer, label_indexer, assembler, classifier])

‹#›© Cloudera, Inc. All rights reserved.

Deploy!

Historic Data Model Training Persisted

Model

Model Scoring

New Data

Model Result

‹#›© Cloudera, Inc. All rights reserved.

You have a few options: • Pickle • Joblib • PMML • Custom

Well, how did you save your model?

‹#›© Cloudera, Inc. All rights reserved.

Insecure Not Portable Big Slow

“Pickles are for delis”

http://pyvideo.org/pycon-us-2014/pickles-are-for-delis-not-software.html

‹#›© Cloudera, Inc. All rights reserved.

Storing Models as PMML

// Export a Spark MLLib model to a local file in PMML format pipeline.toPMML(“/path/to_my_file.xml”)

// Export a scikit-learn model to a file in PMML format from sklearn2pmml import sklearn2pmml

sklearn2pmml(iris_pipeline, “DecisionTreeIris.pmml", with_repr = True)

‹#›© Cloudera, Inc. All rights reserved.

Spark PMML Export Supported Models

‹#›© Cloudera, Inc. All rights reserved.

Distributed Model Fitting

‹#›© Cloudera, Inc. All rights reserved.

Modeling Lifecycle

Historic Data Model Training Persisted

Model

Model Scoring

New Data

Model Result

‹#›© Cloudera, Inc. All rights reserved.

Model Training

MTraining

Data

Test Data

Model Pipeline: Featurization, Model Fitting

Persisted Model Evaluation

Historic Data

‹#›© Cloudera, Inc. All rights reserved.

MLLib Pipelinesfrom pyspark.ml.feature import StringIndexerfrom pyspark.ml.feature import VectorAssemblerfrom pyspark.ml import Pipelinefrom pyspark.ml.classification import DecisionTreeClassifier

label_indexer = StringIndexer(inputCol = ‘churned', outputCol = 'label')plan_indexer = StringIndexer(inputCol = ‘intl_plan', outputCol = 'intl_plan_indexed')

assembler = VectorAssembler( inputCols = ['intl_plan_indexed'] + reduced_numeric_cols, outputCol = 'features')classifier = DecisionTreeClassifier(labelCol = 'label', featuresCol = 'features')

pipeline = Pipeline(stages=[plan_indexer, label_indexer, assembler, classifier])

‹#›© Cloudera, Inc. All rights reserved.

Distributed Grid Search

‹#›© Cloudera, Inc. All rights reserved.

Modeling Lifecycle

Historic Data Model Training Persisted

Model

Model Scoring

New Data

Model Result

‹#›© Cloudera, Inc. All rights reserved.

Model Training

MTraining Data

Test

Model Pipeline:

Persisted Model Evaluation

MTraining Data

Test

Model Pipeline:

Persisted Model Evaluation

MTraining Data

Test

Model Pipeline:

Persisted Model Evaluation

MTraining Data

Test

Model Pipeline:

Persisted Model Evaluation

MTraining Data

Test

Model Pipeline:

Persisted Model Evaluation

MTraining Data

Test

Model Pipeline:

Persisted Model Evaluation

‹#›© Cloudera, Inc. All rights reserved.

Fit multiple models… Serially

‹#›© Cloudera, Inc. All rights reserved.

from sklearn import ensemblefrom sklearn.grid_search import GridSearchCV

X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y,test_size=0.2)

tuned_parameters = { "n_estimators": [ 300 ], "max_depth" : [ 4 ], "learning_rate": [ 0.01 ], "min_samples_split" : [ 1 ], "loss" : [ 'ls', 'lad' ]}

gbr = ensemble.GradientBoostingClassifier()clf = GridSearchCV(gbr, cv=3, param_grid=tuned_parameters, scoring="median_absolute_error")preds = clf.fit(X_train, y_train)best = clf.best_estimator_

‹#›© Cloudera, Inc. All rights reserved.

Fit multiple models… in Parallel

‹#›© Cloudera, Inc. All rights reserved.

from sklearn import ensemblefrom sklearn.grid_search import GridSearchCV

X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y,test_size=0.2)

tuned_parameters = { "n_estimators": [ 300, 400, 200 ], "max_depth" : [ 4, 3 ], "learning_rate": [ 0.01, 0.05, 0.001 ], "min_samples_split" : [ 1, 3 ], "loss" : [ 'ls', 'lad' ]}

gbr = ensemble.GradientBoostingClassifier()clf = GridSearchCV(gbr, cv=3, param_grid=tuned_parameters, scoring="median_absolute_error", n_jobs=10, pre_dispatch=2)preds = clf.fit(X_train, y_train)best = clf.best_estimator_

‹#›© Cloudera, Inc. All rights reserved.

Fit multiple models… Distributed

https://bigdatapix.tumblr.com/

‹#›© Cloudera, Inc. All rights reserved.

from sklearn import ensemblefrom spark_sklearn import GridSearchCV

X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y,test_size=0.2)

tuned_parameters = { "n_estimators": [ 300, 400, 200 ], "max_depth" : [ 4, 3 ], "learning_rate": [ 0.01, 0.05, 0.001 ], "min_samples_split" : [ 1, 3 ], "loss" : [ 'ls', 'lad' ]}

gbr = ensemble.GradientBoostingClassifier()clf = GridSearchCV(gbr, cv=3, param_grid=tuned_parameters, scoring="median_absolute_error")preds = clf.fit(X_train, y_train)best = clf.best_estimator_

‹#›© Cloudera, Inc. All rights reserved.

Distributed Model Scoring

‹#›© Cloudera, Inc. All rights reserved.

What do you mean by “Deploy?”

Historic Data Model Training Persisted

Model

Model Scoring

New Data

Model Result

‹#›© Cloudera, Inc. All rights reserved.

Scoring with REST Server

Persisted Model

Model Scoring

HTTP Request

HTTP Response

‹#›© Cloudera, Inc. All rights reserved.

Distributed Batch Model Scoring: With REST server

‹#›© Cloudera, Inc. All rights reserved.

Distributed Batch Model Scoring

Historic Data Model Training Persisted

Model

Model Scoring

New Data

Model Result

‹#›© Cloudera, Inc. All rights reserved.

Distributed Batch Model Scoring: With REST Server

‹#›© Cloudera, Inc. All rights reserved.

Distributed Batch Model Scoring: With Spark + JPMML

File pmmlFile = ...;

Evaluator evaluator = EvaluatorUtil.createEvaluator(pmmlFile);

TransformerBuilder pmmlTransformerBuilder = new TransformerBuilder(evaluator).withTargetCols().withOutputCols().exploded(false);

Transformer pmmlTransformer = pmmlTransformerBuilder.build();

‹#›© Cloudera, Inc. All rights reserved.

Modeling Lifecycle

Historic Data Model Training Persisted

Model

Model Scoring

New Data

Model Result

‹#›© Cloudera, Inc. All rights reserved.

Juliet Hougland @j_houg

Thank You!

Recommended