From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy

Moritz Meister, @morimeisterSoftware Engineer, Logical Clocks

Jim Dowling, @jim_dowlingAssociate Professor, KTH Royal Institute of Technology

ML Model DevelopmentA simplified view

Exploration Experimentation Model TrainingExplainability and Validation ServingFeature

Pipelines

ML Model Development

Exploreand Design

Experimentation: Tune and Search

Model Training(Distributed)

Explainability and Ablation Studies

It’s simple - only four steps

Artifacts and Non DRY Code

Exploreand Design




What It’s Really Like… not linear but iterative

What It’s Really Really Like… not linear but iterative

Root Cause: Iterative Development of ML Models

Exploreand Design




Ablation StudiesEDA HParam Tuning Training (Dist)

Iterative Development Is a Pain, We Need DRY Code!Each step requires different implementations of the training code

OBLIVIOUS TRAINING FUNCTION

# RUNS ON THE WORKERS def train():def input_fn(): # return datasetmodel = …optimizer = …model.compile(…)rc = tf.estimator.RunConfig(

‘CollectiveAllReduceStrategy’)keras_estimator = tf.keras.estimator.

model_to_estimator(….)tf.estimator.train_and_evaluate(

keras_estimator, input_fn)

Ablation StudiesEDA HParam Tuning Training (Dist)

The Oblivious Training Function

Challenge: Obtrusive Framework Artifacts

▪ TF_CONFIG▪ Distribution Strategy▪ Dataset (Sharding, DFS)▪ Integration in Python - hard from inside a notebook▪ Keras vs. Estimator vs. Custom Training Loop

Example: TensorFlow

Where is Deep Learning headed?

Productive High-Level APIsOr why data scientists love Keras and PyTorch

Idea

Experiment

ResultsInfrastructure

Framework

TrackingVisualization

Francois Chollet, “Keras: The Next 5 Years”

Productive High-Level APIsOr why data scientists love Keras and PyTorch

Idea

Experiment

ResultsInfrastructure

Framework

TrackingVisualization

Francois Chollet, “Keras: The Next 5 Years”

? Hopsworks (Open Source)DatabricksApache SparkCloud Providers

How do we keep our high-level APIs transparent and productive?

What Is Transparent Code?

def dataset(batch_size):(x_train, y_train) = load_data()x_train = x_train / np.float32(255)y_train = y_train.astype(np.int64)train_dataset = tf.data.Dataset.from_tensor_slices((x_train,y_train)).shuffle(60000).repeat().batch(batch_size)

return train_dataset

def build_and_compile_cnn_model(lr):model = tf.keras.Sequential([

tf.keras.Input(shape=(28, 28)),tf.keras.layers.Conv2D(32, 3, activation='relu'),tf.keras.layers.Flatten(),tf.keras.layers.Dense(128, activation='relu'),tf.keras.layers.Dense(10)

])model.compile(

loss=SparseCategoricalCrossentropy(from_logits=True),optimizer=SGD(learning_rate=lr))

return model

def dataset(batch_size):(x_train, y_train) = load_data()x_train = x_train / np.float32(255)y_train = y_train.astype(np.int64)train_dataset = tf.data.Dataset.from_tensor_slices((x_train,y_train)).shuffle(60000).repeat().batch(batch_size)

return train_dataset

def build_and_compile_cnn_model(lr):model = tf.keras.Sequential([

tf.keras.Input(shape=(28, 28)),tf.keras.layers.Conv2D(32, 3, activation='relu'),tf.keras.layers.Flatten(),tf.keras.layers.Dense(128, activation='relu'),tf.keras.layers.Dense(10)

])model.compile(

loss=SparseCategoricalCrossentropy(from_logits=True),optimizer=SGD(learning_rate=lr))

return model

NO CHANGES!

Building Blocks for Distribution Transparency

Distribution ContextSingle-host vs. parallel multi-host vs. distributed multi-host

Worker 1

Worker 5

Worker 3

Worker 2

Worker 4

Worker 7

Worker 8

Worker 6

DriverTF_CONFIG

DriverExperiment Controller

Worker 1 Worker NWorker 2

Single Host

Distribution ContextSingle-host vs. parallel multi-host vs. distributed multi-host

Worker 1

Worker 5

Worker 3

Worker 2

Worker 4

Worker 7

Worker 8

Worker 6

DriverTF_CONFIG

DriverExperiment Controller

Worker 1 Worker NWorker 2

Single Host

Exploreand Design




Model Development Best Practices

▪ Modularize▪ Parametrize▪ Higher order training

functions▪ Usage of callbacks at

runtime

DatasetGeneration

Model Generation

TrainingLogic

Oblivious Training Function as an AbstractionLet the system handle the complexities

System takes care of ...

… fixing parameters… launching

the function

… launching trials (parametrized instantiations of the function)

… generating new trials… collecting and logging results

… setting up TF_CONFIG… wrapping in Distribution Strategy… launching function as workers… collecting results

Maggy

Spark+AI Summit 2019

TodayWith Hopsworks and Maggy, we provide a unified development and execution environment for distribution transparent ML model development.

Make the Oblivious Training Function a core abstraction on Hopsworks

Hopsworks - Award Winning Plattform

Recap: Maggy - Asynchronous Trials on SparkSpark is bulk-synchronous

WastedCompute

WastedCompute

HopsFS

Barrier

Task11

Task12

Task13

Task1N

Driver

Metrics1

Barrier

Task21

Task22

Task23

Task2N

Metrics2

BarrierTask31

Task32

Task33

Task3N

Metrics3

WastedCompute

Early-Stopping

Recap: The SolutionAdd Communication and Long Running Tasks

Task11

Task12

Task13

Task1N

Driver

Barrier

Metrics New Trial

What’s New?Worker discovery and distribution context set-up

Task11

Task12

Task13

Task1N

Driver

Barrier

Launch Oblivious Training Function in Context

Discover Workers

What’s New: Distribution Context

sp = maggy.optimization.Searchspace(...)dist_strat = tf.keras.distribute.MirroredStrategy(...)

ab = maggy.ablation.AblationStudy(...)

maggy.set_context('optimization’)maggy.lagom(training_function, sp)

maggy.set_context(‘distributed_training’)maggy.lagom(training_function, dist_strat)

maggy.set_context(‘ablation’)maggy.lagom(training_function, ab)

DEMO

What’s Next

Extend the platform to provide a unified development and execution environment for distribution transparent Jupyter Notebooks.

Summary

▪ Moving between distribution contexts requires code rewriting▪ Factor out obtrusive framework artifacts▪ Let system handle distribution context▪ Keep productive high-level APIs

Thank You!

Get Startedhopsworks.aigithub.com/logicalclocks/maggy

Twitter@morimeister@jim_dowling@logicalclocks@hopsworks

Webwww.logicalclocks.com

Contributions from colleagues▪ Sina Sheikholeslami▪ Robin Andersson▪ Alex Ormenisan▪ Kai Jeggle

Thanks to the Logical Clocks Team!

Feedback

Your feedback is important to us.

Don’t forget to rate and review the sessions.

Documents

From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software