Distributed TensorFlow: Scaling Google's Deep Learning Library on Spark

Preview:

Citation preview

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

Christopher Nguyen, PhD. | Chris Smith | Ushnish De | Vu Pham | Nanda Kishore

DISTRIBUTED TENSORFLOW:

SCALING GOOGLE'S DEEP LEARNING LIBRARY ON SPARK

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

Feb 2016

Distributed TensorFlow

on Spark

Spark Summit East

June 2016

TensorFlow Ensembles

Spark

Hadoop Summit

Nov 2015

Distributed DL on Spark + Tachyon

Strata NYC

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

1. Motivation

2. Architecture

3. Experimental Parameters

4. Results

5. Lesson Learned

6. Summary

Outline

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

Motivation

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

Google TensorFlow1Spark Deployments2

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

TensorFlow Review

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

Computational Graph

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

Graph of a Neural Network

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

Google uses TensorFlow for the following:

★ Search ★ Advertising ★ Speech Recognition ★ Photos

★ Maps ★ StreetView—Blurring out house numbers ★ Translate ★ YouTube

Why TensorFlow?

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

DistBelief

Large Scale Distributed Deep Networks, Jeff Dean et al, NIPS 2012

1. Compute Gradients

2. Update Params (Descent)

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

Arimo’s 3 Pillars

App Development

BIG APPS

PREDICTIVE ANALYTICS

NATURAL INTERFACES

COLLABO-RATION

Big Data + Big Compute

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

Architecture

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

TensorSpark Architecture

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

Spark & Tachyon Architectural Options

Model as Broadcast Variable

Spark-Only

Model Stored as

Tachyon File

Tachyon- Storage

Model Hosted in

HTTP Server

Param- Server

Model Stored and Updated by Tachyon

Tachyon- CoProc

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

Experimental Parameters

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

Data Set Size ParametersHidden Layers

HiddenUnits

MNIST50K x 784 =39.2M

1.8M 2 1024

Higgs10M x 28 =280M

8.5M 3 2048

Molecular150K x 2871

=430M14.2M 3 2048

Datasets

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

Experimental Dimensions

Dataset Size1Model Size2Compute Size3CPU vs. GPU4

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

Results

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

MNIST

Higgs

Molecular

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

Lessons Learned

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

GPU vs CPU

★ GPU is 8-10x faster on local; lower gains with Spark but better for larger data sets

★ On local: GPU 8–10x faster

★ On distributed: GPU 2–3x faster, more with larger datasets

★ GPU memory is limited > Performance impact if hit

★ AWS has old GPUs, need to build TF from source

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

Distributed Gradients

★ Initial warmup phase on param server needed

★ “Wobble” effect may still happen with convergence due to mini-batches

✦ Drop “obsolete” gradients

✦ Reduce learning rate and initial parameters

✦ Reduce mini-batch size

★ Work to maximize network throughput, e.g., model compression

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

Deployment Configuration

★ Should run only 1 TF instance per machine

★ Compress gradients/parameters (e.g., as binaries) to reduce network overhead

★ Batch-size tradeoff: convergence vs. communication overhead

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

Conclusions

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

Conclusions★ This implementation best for large datasets & small models

✦ For large models, we’re looking into a) model parallelism b) model compression

★ If data fits on single machine, single-machine GPU would be best (if you have it)

★ For large datasets, leverage both speed and scale using Spark cluster with GPUs

★ Future Work: a) Tachyon b) Ensembles (Hadoop Summit - June 2016)

@pentagoniac @arimoinc #SparkSummit #DDF arimo.com

Top 10 World’sMost Innovative

Companies in Data Science

https://github.com/adatao/tensorspark

VISIT US BOOTH

K8 We’re OPEN SOURCING our work at

Recommended