Upload
others
View
25
Download
0
Embed Size (px)
Citation preview
From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy
Moritz Meister, @morimeisterSoftware Engineer, Logical Clocks
Jim Dowling, @jim_dowlingAssociate Professor, KTH Royal Institute of Technology
ML Model DevelopmentA simplified view
Exploration Experimentation Model TrainingExplainability and Validation ServingFeature
Pipelines
ML Model Development
Exploreand Design
Experimentation: Tune and Search
Model Training(Distributed)
Explainability and Ablation Studies
It’s simple - only four steps
Artifacts and Non DRY Code
Exploreand Design
Experimentation: Tune and Search
Model Training(Distributed)
Explainability and Ablation Studies
What It’s Really Like… not linear but iterative
What It’s Really Really Like… not linear but iterative
Root Cause: Iterative Development of ML Models
Exploreand Design
Experimentation: Tune and Search
Model Training(Distributed)
Explainability and Ablation Studies
Ablation StudiesEDA HParam Tuning Training (Dist)
Iterative Development Is a Pain, We Need DRY Code!Each step requires different implementations of the training code
OBLIVIOUS TRAINING FUNCTION
# RUNS ON THE WORKERS def train():def input_fn(): # return datasetmodel = …optimizer = …model.compile(…)rc = tf.estimator.RunConfig(
‘CollectiveAllReduceStrategy’)keras_estimator = tf.keras.estimator.
model_to_estimator(….)tf.estimator.train_and_evaluate(
keras_estimator, input_fn)
Ablation StudiesEDA HParam Tuning Training (Dist)
The Oblivious Training Function
Challenge: Obtrusive Framework Artifacts
▪ TF_CONFIG▪ Distribution Strategy▪ Dataset (Sharding, DFS)▪ Integration in Python - hard from inside a notebook▪ Keras vs. Estimator vs. Custom Training Loop
Example: TensorFlow
Where is Deep Learning headed?
Productive High-Level APIsOr why data scientists love Keras and PyTorch
Idea
Experiment
ResultsInfrastructure
Framework
TrackingVisualization
Francois Chollet, “Keras: The Next 5 Years”
Productive High-Level APIsOr why data scientists love Keras and PyTorch
Idea
Experiment
ResultsInfrastructure
Framework
TrackingVisualization
Francois Chollet, “Keras: The Next 5 Years”
? Hopsworks (Open Source)DatabricksApache SparkCloud Providers
How do we keep our high-level APIs transparent and productive?
What Is Transparent Code?
def dataset(batch_size):(x_train, y_train) = load_data()x_train = x_train / np.float32(255)y_train = y_train.astype(np.int64)train_dataset = tf.data.Dataset.from_tensor_slices((x_train,y_train)).shuffle(60000).repeat().batch(batch_size)
return train_dataset
def build_and_compile_cnn_model(lr):model = tf.keras.Sequential([
tf.keras.Input(shape=(28, 28)),tf.keras.layers.Conv2D(32, 3, activation='relu'),tf.keras.layers.Flatten(),tf.keras.layers.Dense(128, activation='relu'),tf.keras.layers.Dense(10)
])model.compile(
loss=SparseCategoricalCrossentropy(from_logits=True),optimizer=SGD(learning_rate=lr))
return model
def dataset(batch_size):(x_train, y_train) = load_data()x_train = x_train / np.float32(255)y_train = y_train.astype(np.int64)train_dataset = tf.data.Dataset.from_tensor_slices((x_train,y_train)).shuffle(60000).repeat().batch(batch_size)
return train_dataset
def build_and_compile_cnn_model(lr):model = tf.keras.Sequential([
tf.keras.Input(shape=(28, 28)),tf.keras.layers.Conv2D(32, 3, activation='relu'),tf.keras.layers.Flatten(),tf.keras.layers.Dense(128, activation='relu'),tf.keras.layers.Dense(10)
])model.compile(
loss=SparseCategoricalCrossentropy(from_logits=True),optimizer=SGD(learning_rate=lr))
return model
NO CHANGES!
Building Blocks for Distribution Transparency
Distribution ContextSingle-host vs. parallel multi-host vs. distributed multi-host
Worker 1
Worker 5
Worker 3
Worker 2
Worker 4
Worker 7
Worker 8
Worker 6
DriverTF_CONFIG
DriverExperiment Controller
Worker 1 Worker NWorker 2
Single Host
Distribution ContextSingle-host vs. parallel multi-host vs. distributed multi-host
Worker 1
Worker 5
Worker 3
Worker 2
Worker 4
Worker 7
Worker 8
Worker 6
DriverTF_CONFIG
DriverExperiment Controller
Worker 1 Worker NWorker 2
Single Host
Exploreand Design
Experimentation: Tune and Search
Model Training(Distributed)
Explainability and Ablation Studies
Model Development Best Practices
▪ Modularize▪ Parametrize▪ Higher order training
functions▪ Usage of callbacks at
runtime
DatasetGeneration
Model Generation
TrainingLogic
Oblivious Training Function as an AbstractionLet the system handle the complexities
System takes care of ...
… fixing parameters… launching
the function
… launching trials (parametrized instantiations of the function)
… generating new trials… collecting and logging results
… setting up TF_CONFIG… wrapping in Distribution Strategy… launching function as workers… collecting results
Maggy
Spark+AI Summit 2019
TodayWith Hopsworks and Maggy, we provide a unified development and execution environment for distribution transparent ML model development.
Make the Oblivious Training Function a core abstraction on Hopsworks
Hopsworks - Award Winning Plattform
Recap: Maggy - Asynchronous Trials on SparkSpark is bulk-synchronous
WastedCompute
WastedCompute
HopsFS
Barrier
Task11
Task12
Task13
Task1N
Driver
Metrics1
Barrier
Task21
Task22
Task23
Task2N
Metrics2
BarrierTask31
Task32
Task33
Task3N
Metrics3
WastedCompute
Early-Stopping
Recap: The SolutionAdd Communication and Long Running Tasks
Task11
Task12
Task13
Task1N
Driver
Barrier
Metrics New Trial
What’s New?Worker discovery and distribution context set-up
Task11
Task12
Task13
Task1N
Driver
Barrier
Launch Oblivious Training Function in Context
Discover Workers
What’s New: Distribution Context
sp = maggy.optimization.Searchspace(...)dist_strat = tf.keras.distribute.MirroredStrategy(...)
ab = maggy.ablation.AblationStudy(...)
maggy.set_context('optimization’)maggy.lagom(training_function, sp)
maggy.set_context(‘distributed_training’)maggy.lagom(training_function, dist_strat)
maggy.set_context(‘ablation’)maggy.lagom(training_function, ab)
DEMO
What’s Next
Extend the platform to provide a unified development and execution environment for distribution transparent Jupyter Notebooks.
Summary
▪ Moving between distribution contexts requires code rewriting▪ Factor out obtrusive framework artifacts▪ Let system handle distribution context▪ Keep productive high-level APIs
Thank You!
Get Startedhopsworks.aigithub.com/logicalclocks/maggy
Twitter@morimeister@jim_dowling@logicalclocks@hopsworks
Webwww.logicalclocks.com
Contributions from colleagues▪ Sina Sheikholeslami▪ Robin Andersson▪ Alex Ormenisan▪ Kai Jeggle
Thanks to the Logical Clocks Team!
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.