Automation and optimisation of machine learning pipelines on top of Apache
Peter Rudenko@peter_rud
DataRobot data pipelineData
upload
Training models, selecting best models & hyperparameters
Exploratory data
analysis
Models leaderboard
Prediction API
Our journey to Apache Spark
PySpark vs Scala API?
Spark workerJVM
Python process
Sending instructions:
df.agg({"age": "max"})
FAST!
Spark workerJVM
Python process
Sending data:
data.map(lambda x: …)
data.filter(lambda x: …)
SLOW!
Instructions py4j Data ipc/serde
Our journey to Apache Spark
RDD vs DataFrame
RDD[Row[(Double, String, Vector)]]
Dataframe(DoubleType, nullable=true) + Attributes (in spark-1.4)
Dataframe(StringType, nullable=true) + Attributes (in spark-1.4)
Dataframe(VectorType, nullable=true) + Attributes (in spark-1.4)
Attributes:NumericAttributeNominalAttribute (Ordinal)BinaryAttribute
Our journey to Apache Spark
Mllib vs ML
Mllib:● Low - level implementation of machine learning algorithms● Based on RDD
ML:● High-level pipeline abstractions● Based on dataframes● Uses mllib under the hood.
Columnar format● Compression● Scan optimization● Null-imputor improvement
- val na2mean = {value: Double =>- if (value.isNaN) meanValue else value- }- dataset.withColumn(map(outputCol), callUDF(na2mean, DoubleType, dataset(map(inputCol))))
+ dataset.na.fill(map(inputCols).zip(meanValues).toMap)
Typical machine learning pipeline
● Features extraction● Missing values imputation● Variables encoding● Dimensionality reduction
● Training model (finding the optimal model parameters)
● Selecting hyperparameters
Model evaluation on some metric (AUC, R2, RMSE, etc.)
Train data (features + label)
Test data (features) Model state (parameters + hyperparameters)
Prediction
Introducing Blueprint
Pipeline configpipeline: { "1": { input: ["NUM"], class: "org.apache.spark.ml.feature.MeanImputor" }, "2": { input: ["CAT"], class: "org.apache.spark.ml.feature.OneHotEncoder" }, "3":{ input: ["1", "2"], class: "org.apache.spark.ml.feature.VectorAssembler" }, "4": { input: "3", class : "org.apache.spark.ml.classification.LogisticRegression", params: { optimizer: “LBFGS”, regParam: [0.5, 0.1, 0.01, 0.001] } } }
Introducing Blueprint YARN cluster
BlueprintSpark jobserver
Transformer (pure function)abstract class Transformer extends PipelineStage with Params {
/** * Transforms the dataset with provided parameter map as additional parameters. * @param dataset input dataset * @param paramMap additional parameters, overwrite embedded params * @return transformed dataset */ def transform(dataset: DataFrame, paramMap: ParamMap): DataFrame}
Example:
(new HashingTF).setInputCol("categorical_column").setOutputCol("Hashing_tf_1").setNumFeatures(1<<20).transform(data)
Estimatorabstract class Estimator[M <: Model[M]] extends PipelineStage with Params {
/** * Fits a single model to the input data with optional parameters. * * @param dataset input dataset * @param paramPairs Optional list of param pairs. * These values override any specified in this Estimator's embedded ParamMap. * @return fitted model */ @varargs def fit(dataset: DataFrame, paramPairs: ParamPair[_]*): M = { val map = ParamMap(paramPairs: _*) fit(dataset, map) }}
Example:
val oneHotEncoderModel = (new OneHotEncoder).setInputCol("vector_col").fit(trainingData)
oneHotEncoderModel.transform(trainingData)oneHotEncoderModel.transform(testData)
Estimator => Transformer
PredictorEstimator that predicts a value
ProbabilisticClassifier
Predictor
Classifier Regressor
Evaluatorabstract class Evaluator extends Identifiable {
/** * Evaluates the output. * * @param dataset a dataset that contains labels/observations and predictions. * @param paramMap parameter map that specifies the input columns and output metrics * @return metric */ def evaluate(dataset: DataFrame, paramMap: ParamMap): Double}
Example:
val areaUnderROC = (new BinaryClassificationEvaluator).setScoreCol("prediction").evaluate(data)
Pipeline
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
Inputdata
Tock
eniz
er
Has
hing
TF
Logi
stic
R
egre
ssio
n
fit
Pipeline Model
Estimator that encapsulates other transformers / estimators
CrossValidatorval crossval = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(new
BinaryClassificationEvaluator)
val paramGrid = new ParamGridBuilder()
.addGrid(hashingTF.numFeatures, Array
(10, 100, 1000))
.addGrid(lr.regParam, Array(0.1, 0.01)) .
build()
crossval.setEstimatorParamMaps(
paramGrid)
crossval.setNumFolds(3)
val cvModel = crossval.fit(training.toDF)
Inputdata
Tock
eniz
er
Has
hing
TF
Logi
stic
R
egre
ssio
nfit
CrossVal Model
numFeatures:{10, 100, 1000}
regParam:{0.1, 0.01}
Folds
Pluggable backend● H20● Flink● DeepLearning4J● http://keystone-ml.org/● etc.
Optimization● Disable k-fold cross validation● Minimize redundant pre-processing● Parallel grid search
● Parallel DAG pipeline● Pluggable optimizer● Non-gridsearch hyperparameter optimization
(bayesian & hypergrad):http://arxiv.org/pdf/1502.03492v2.pdfhttp://arxiv.org/pdf/1206.2944.pdfhttp://arxiv.org/pdf/1502.05700v1.pdf
Minimize redundant pre-processingregParam:
0.1
regParam:0.01
val rdd1 = rdd.map(function)val rdd2 = rdd.map(function)rdd1 != rdd2
Summary● Good model != good result. Feature engineering is
the key.● Spark provides a good abstraction, but need to tune
some parts to achieve good performance.● ml pipeline API gives a pluggable and reusable
building blocks.● Don’t forget to clean after yourself (unpersist cache).
Thanks,Demo & QA