Upload
databricks
View
422
Download
6
Embed Size (px)
Citation preview
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Bay Area Spark MeetupNov 8, 2017
Sue Ann Hong, Databricks
This talk
• Deep Learning at scale: current state• Deep Learning Pipelines: the philosophy• End-to-end workflow with DL Pipelines• Future
Deep Learning at Scale: current state
3putyour#assignedhashtagherebysettingthefooterinview-header/footer
What is Deep Learning?
• A set of machine learning techniques that use layers that transform numerical inputs• Classification• Regression• Arbitrary mapping
• Popular in the 80’s as Neural Networks• Recently came back thanks to advances in data collection,
computation techniques, and hardware.
Success of Deep Learning
Tremendous success for applications with complex data• AlphaGo• Image interpretation• Automatic translation• Speech recognition
Deep Learning is often challengingLabeled
DataCompute Resources
& Time Engineer hours
• Tedious or difficult to distribute computations• No exact science around deep learning à lots of tweaking• Low level APIs with steep learning curve• Complex models à need a lot of data
Deep Learning in industry
• Currently limited adoption• Huge potential beyond the industrial giants• How do we accelerate the road to massive availability?
Deep Learning Pipelines
8putyour#assignedhashtagherebysettingthefooterinview-header/footer
Deep Learning Pipelines:Deep Learning with Simplicity
• Open-source Databricks library• Focuses on ease of use and integration• without sacrificing performance
Deep Learning is often challengingCompute Resources
& TimeEngineer hours Labeled
Data
• Tedious or difficult to distribute computations• No exact science around deep learning à lots of tweaking• Low level APIs with steep learning curve• Complex models à need a lot of data
Deep Learning is often challenging
• Tedious or difficult to distribute computations
• No exact science around deep learning à lots of tweaking
• Low level APIs with steep learning curve
• Complex models à need a lot of data
• Tedious or difficult to distribute computations
• No exact science around deep learning à lots of tweaking
• Low level APIs with steep learning curve
• Complex models à need a lot of data
Instead
• Be easy to scale
• Require little tweaking
• Be easy to write
• Require little or no data
Commonworkflowsshould
How
• Be easy to scale
• Require little tweaking
• Be easy to write
• Require little or no data
Commonworkflowsshould
• Use Apache Spark for scaling out common tasks
• Leverage well-known model architectures
• Integrate with MLlib Pipelines API to capture ML workflow concisely
• Leverage pre-trained models for common tasks
Apache Spark for scaling out
MLlib Pipelines API
pre-trained models
Demo: Build a visual recommendation AI
10minutes
7lines of code
Elastic Scale-out using Apache Spark
MLlib Pipelines API Leveragepre-trained models
0labels
To the notebook!
End-to-End Workflowwith Deep Learning Pipelines
18putyour#assignedhashtagherebysettingthefooterinview-header/footer
A typical Deep Learning workflow
• Load data (images, text, time series, …)• Interactive work• Train• Select an architecture for a neural network• Optimize the weights of the NN
• Evaluate results, potentially re-train• Apply:• Pass the data through the NN to produce new features or output
Load data Interactive workTrain
EvaluateApply
A typical Deep Learning workflow
Load data
Interactive work
Train
Evaluate
Apply
• ImageloadinginSpark
• Distributedbatchprediction• DeployingmodelsinSQL
• Transferlearning• Distributedtuning
• Pre-trainedmodels
Deep Learning Pipelines
• Load data
• Interactive work
• Train
• Evaluate model
• Apply
Adds support for images in Spark
• ImageSchema, reader, conversion functions to/from numpy arrays• Most of the tools we’ll describe work on ImageSchema columns
from sparkdl import readImagesimage_df = readImages(sample_img_dir)
Upcoming: built-in support in Spark
• Spark-21866• Contributing image format & reading to Spark• Targeted for Spark 2.3• Joint work with Microsoft
23putyour#assignedhashtagherebysettingthefooterinview-header/footer
Deep Learning Pipelines
• Load data
• Interactive work
• Train
• Evaluate model
• Apply
Applying popular models
• Popular pre-trained models accessible through MLlib Transformers
predictor = DeepImagePredictor(inputCol="image", outputCol="predicted_labels", modelName="InceptionV3")
predictions_df = predictor.transform(image_df)
Applying popular modelspredictor = DeepImagePredictor(inputCol="image",
outputCol="predicted_labels", modelName="InceptionV3")
predictions_df = predictor.transform(image_df)
Deep Learning Pipelines
• Load data
• Interactive work
• Train
• Evaluate model
• Apply
Hyperparameter tuning
Transferlearning
Deep Learning Pipelines
• Load data
• Interactive work
• Train
• Evaluate model
• Apply
Hyperparameter tuning
Transferlearning
Transfer learning
• Pre-trained models may not be directly applicable• New domain, e.g. shoes
• Training from scratch requires • Enormous amounts of data• A lot of compute resources & time
• Idea: intermediate representations learned for one task may be useful for other related tasks
Transfer Learning
SoftMaxGIANT PANDA 0.9RACCOON 0.05RED PANDA 0.01…
Transfer Learning
Transfer Learning
Classifier
Transfer Learning
Classifier
Rose: 0.7
Daisy: 0.3
Transfer Learning as a Pipeline
DeepImageFeaturizerImage
Loading PreprocessingLogistic
Regression
MLlib Pipeline
Transfer Learning as a Pipeline
35putyour#assignedhashtagherebysettingthefooterinview-header/footer
featurizer = DeepImageFeaturizer(inputCol="image",outputCol="features",modelName="InceptionV3")
lr = LogisticRegression(labelCol="label")
p = Pipeline(stages=[featurizer, lr])
p_model = p.fit(train_df)
Transfer Learning
• Usually for classification tasks• Similar task, new domain
• But other forms of learning leveraging learned representations can be loosely considered transfer learning
Featurization for similarity-based ML
DeepImageFeaturizerImage
Loading PreprocessingLogistic
Regression
Featurization for similarity-based ML
DeepImageFeaturizerImage
Loading PreprocessingClusteringKMeans
GaussianMixture
Nearest NeighborKNN LSH
Distance computation
Featurization for similarity-based ML
DeepImageFeaturizerImage
Loading PreprocessingClusteringKMeans
GaussianMixture
Nearest NeighborKNN LSH
Distance computation
Duplicate Detection
RecommendationAnomaly Detection
Search result diversification
Deep Learning Pipelines
• Load data
• Interactive work
• Train
• Evaluate model
• Apply
Hyperparameter tuning
Transferlearning
Deep Learning Pipelines
• Load data
• Interactive work
• Train
• Evaluate model
• Apply
Deep Learning Pipelines
• Load data
• Interactive work
• Train
• Evaluate model
• ApplySparkSQL
Batchprediction
s
Deep Learning Pipelines
• Load data
• Interactive work
• Train
• Evaluate model
• ApplySparkSQL
Batchprediction
Batch prediction as an MLlib Transformer
• A model is a Transformer in MLlib• DataFrame goes in, DataFrame comes out with output columns
predictor = XXTransformer(inputCol="image",outputCol="prediction",modelSpecification={…})
predictions = predictor.transform(test_df)
Hierarchy of DL transformers for images
46
TFImageTransformer
KerasImageTransformer
DeepImageFeaturizerDeepImagePredictor
NamedImageTransformer (model_name)
(keras.Model)
(tf.Graph)
Input(Image)
Output(vector)
model_namemodel_name
Keras.Model
tf.Graph
Hierarchy of DL transformers for images
47
TFImageTransformer
KerasImageTransformer
DeepImageFeaturizerDeepImagePredictor
NamedImageTransformer (model_name)
(keras.Model)
(tf.Graph)
Input(Image)
Output(vector)
model_namemodel_name
Keras.Model
tf.Graph
Hierarchy of DL transformers
48
TFImageTransformer
KerasImageTransformer
DeepImageFeaturizer
DeepImagePredictor
NamedImageTransformer
model_name)
Input(Image
)
Output(vector
)
model_namemodel_name
Keras.Model
tf.Graph
TFTextTransformer
KerasTextTransformer
DeepTextFeaturizer
DeepTextPredictor
NamedTextTransformerInput(Text)
Output(vector)
model_namemodel_name
Keras.Model
tf.Graph
TFTransformer
Hierarchy of DL transformers
49
TFImageTransformer
KerasImageTransformer
DeepImageFeaturizerDeepImagePredictor
NamedImageTransformer
TFTextTransformer
KerasTextTransformer
DeepTextFeaturizerDeepTextPredictor
NamedTextTransformer
TFTransformer
Hierarchy of DL transformers
50
TFImageTransformer
KerasImageTransformer
DeepImageFeaturizerDeepImagePredictor
NamedImageTransformer
TFTextTransformer
KerasTextTransformer
DeepTextFeaturizerDeepTextPredictor
NamedTextTransformer
TFTransformer
Common Tasks EasyAdvanced Ones Possible
51
TFImageTransformer
KerasImageTransformer
DeepImageFeaturizerDeepImagePredictor
NamedImageTransformer
TFTextTransformer
KerasTextTransformer
DeepTextFeaturizerDeepTextPredictor
NamedTextTransformer
TFTransformer
Common Tasks EasyAdvanced Ones Possible
52
TFImageTransformer
KerasImageTransformer
DeepImageFeaturizer
DeepImagePredictor
NamedImageTransformer
TFTextTransformer
KerasTextTransformer
DeepTextFeaturizer
DeepTextPredictor
NamedTextTransformer
TFTransformer
Pre-built Solutions
80%-built Solutions
Self-built Solutions
Deep Learning Pipelines : FutureIn progress• Text featurization (embeddings)• TFTransformer for arbitrary vectors
Future directions• Non-image data domains: video, text, speech, …• Distributed training• Support for more backends, e.g. MXNet, PyTorch, BigDL
Questions?Thank you!
54putyour#assignedhashtagherebysettingthefooterinview-header/footer
ResourcesDL Pipelines GitHub Repo, Spark Summit Europe 2017 Deep Dive
Blog posts & webinars (http://databricks.com/blog)• Deep Learning Pipelines• GPU acceleration in Databricks• BigDL on Databricks• Deep Learning and Apache Spark
Docs for Deep Learning on Databricks (http://docs.databricks.com)• Getting started• Deep Learning Pipelines Example• Spark integration