Distributed TensorFlow: Scaling Google's Deep Learning Library on Spark

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

Christopher Nguyen, PhD. | Chris Smith | Ushnish De | Vu Pham | Nanda Kishore

DISTRIBUTED TENSORFLOW:

SCALING GOOGLE'S DEEP LEARNING LIBRARY ON SPARK

Feb 2016

Distributed TensorFlow

on Spark

Spark Summit East

June 2016

TensorFlow Ensembles

Hadoop Summit

Nov 2015

Distributed DL on Spark + Tachyon

Strata NYC

1. Motivation

2. Architecture

3. Experimental Parameters

4. Results

5. Lesson Learned

6. Summary

Outline

Motivation

Google TensorFlow1Spark Deployments2

TensorFlow Review

Computational Graph

Graph of a Neural Network

Google uses TensorFlow for the following:

★ Search ★ Advertising ★ Speech Recognition ★ Photos

★ Maps ★ StreetView—Blurring out house numbers ★ Translate ★ YouTube

Why TensorFlow?

DistBelief

Large Scale Distributed Deep Networks, Jeff Dean et al, NIPS 2012

1. Compute Gradients

2. Update Params (Descent)

Arimo’s 3 Pillars

App Development

BIG APPS

PREDICTIVE ANALYTICS

NATURAL INTERFACES

COLLABO-RATION

Big Data + Big Compute

Architecture

TensorSpark Architecture

Spark & Tachyon Architectural Options

Model as Broadcast Variable

Spark-Only

Model Stored as

Tachyon File

Tachyon- Storage

Model Hosted in

HTTP Server

Param- Server

Model Stored and Updated by Tachyon

Tachyon- CoProc

Experimental Parameters

Data Set Size ParametersHidden Layers

HiddenUnits

MNIST50K x 784 =39.2M

1.8M 2 1024

Higgs10M x 28 =280M

8.5M 3 2048

Molecular150K x 2871

=430M14.2M 3 2048

Datasets

Experimental Dimensions

Dataset Size1Model Size2Compute Size3CPU vs. GPU4

Results

Molecular

Lessons Learned

GPU vs CPU

★ GPU is 8-10x faster on local; lower gains with Spark but better for larger data sets

★ On local: GPU 8–10x faster

★ On distributed: GPU 2–3x faster, more with larger datasets

★ GPU memory is limited > Performance impact if hit

★ AWS has old GPUs, need to build TF from source

Distributed Gradients

★ Initial warmup phase on param server needed

★ “Wobble” effect may still happen with convergence due to mini-batches

✦ Drop “obsolete” gradients

✦ Reduce learning rate and initial parameters

✦ Reduce mini-batch size

★ Work to maximize network throughput, e.g., model compression

Deployment Configuration

★ Should run only 1 TF instance per machine

★ Compress gradients/parameters (e.g., as binaries) to reduce network overhead

★ Batch-size tradeoff: convergence vs. communication overhead

Conclusions

Conclusions★ This implementation best for large datasets & small models

✦ For large models, we’re looking into a) model parallelism b) model compression

★ If data fits on single machine, single-machine GPU would be best (if you have it)

★ For large datasets, leverage both speed and scale using Spark cluster with GPUs

★ Future Work: a) Tachyon b) Ensembles (Hadoop Summit - June 2016)

Top 10 World’sMost Innovative

Companies in Data Science

https://github.com/adatao/tensorspark

VISIT US BOOTH

K8 We’re OPEN SOURCING our work at

Distributed TensorFlow: Scaling Google's Deep Learning Library on Spark

Technology

Scalable TensorFlow Learning on Spark Clusters€¦ · Scalable TensorFlow Learning on Spark Clusters . What is TensorFlowOnSpark . Why TensorFlowOnSpark at Yahoo? • Major contributor

Machine Learning in PandaRoot - jlab.org · • TensorFlow (Deep Learning) • Keras (on top of TensorFlow with GPUs) • MLlib Apache Spark • Sci-Kit Learn • PyTorch (Deep Learning)

Welcome to TensorFlow! · 2017. 7. 10. · Awesome projects already using TensorFlow 10. Companies using Tensorflow Google OpenAI DeepMind Snapchat Uber Airbus eBay ... TensorFlow

TensorFlow Tutorial

TensorFrames: Google Tensorflow on Apache Spark

Ray, Distributed Computing, and · TensorFlow Serving Flink, many others Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL MapReduce, Hadoop, Spark Vizier, many internal systems

Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubernetes + Distributed TensorFlow GPUs

Deep Learning Lab11: TensorFlow 101€¦ · TensorFlow 4 Read and Preprocess Data tf.keras Premade Estimators TensorFlow Hub Distribution Strategy GPU CPU TPU SavedModel TensorFlow

Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs

AI AND DATA SCIENCE FOR THE ENTERPRISE - Cisco...Spark, Graph TensorFlow, PyTorch MXNet, Scikit-Learn, XGBoost TensorFlow Serving ... from test to production with minimal changes

TensorFlow w/XLA: TensorFlow, Compiled! - Autodiff · PDF fileTensorFlow w/XLA: TensorFlow, Compiled! Expressiveness with performance Jeff Dean Google Brain team ... XLA:TPU TensorFlow

TensorFlow Extended (TFX) · TensorFlow Transform Estimator or Keras Model TensorFlow Model Analysis TensorFlow Serving Logging Shared Utilities for Garbage Collection, Data Access

Google's Success

TensorFlow and Keras - Statinfer...Contents •Deep Learning frameworks •What is TensorFlow •Key terms in TensorFlow •Working with TensorFlow •Regression Model building •MNIST

Introduction to TensorFlow 2...Deep Learning Intro to TensorFlow TensorFlow @ Google 2.0 and Examples Getting Started TensorFlow Deep Learning Doodles courtesy of @dalequark Weight

Tensorflow ruby

Google's World

Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark ML and Tensorflow AI Models from Dev Notebooks to Prod Microservices

WITH APACHE SPARK SCALABLE MONITORING...Frameworks: Apache Spark, Tensorflow and more. Designing high performance and scalable ML Designing high performance and scalable ML platform

Digital Twin - IBM · FULLY BLACK BOX MODELS (Tooling , DL4J, Spark) TensorFlow 1.Gradient boosted trees 2.Hybrid models 3.