37
Inferno Scalable Deep Learning on Spark Matthias Langer [email protected] Dr. Zhen He [email protected] Prof. Wenny Rahayu [email protected] Department of Computer Science & Computer Engineering

Inferno Scalable Deep Learning on Spark

Embed Size (px)

Citation preview

Page 1: Inferno Scalable Deep Learning on Spark

InfernoScalable Deep Learning on Spark

Matthias [email protected]

Dr. Zhen [email protected]

Prof. Wenny [email protected]

Department of Computer Science &Computer Engineering

Page 2: Inferno Scalable Deep Learning on Spark

Topics• Deep Learning – Introduction

• Spark & Deep Learning

• Our solution:La Trobe University’s Deep Learning System

• Conclusion, Timeline, Q&A

Page 3: Inferno Scalable Deep Learning on Spark

Deep LearningIntroduction

Page 4: Inferno Scalable Deep Learning on Spark

Source: CerCo (Brain and Cognition Research Centre), Toulouse

Page 5: Inferno Scalable Deep Learning on Spark

Object/Action Recognition• Automatic Captioning

• Navigating Artificial Agents

• Deep Learning performs

100% better than the best non-deep learning algorithms in many Computer Vision tasks!

Source: Research @ Facebook (left), google.com/selfdrivingcar (right)

Page 6: Inferno Scalable Deep Learning on Spark

Voice Recognition

• Deep Learning performs 30% better than the best non-deep learning algorithms!

Page 7: Inferno Scalable Deep Learning on Spark

Natural Language Processing• Translation

• Thought Vector Q&A

• …

• Deep Learning tends to perform “better” than traditional machine learning algorithms!

Source: Google Inc. / Google Translate

Page 8: Inferno Scalable Deep Learning on Spark

Source: GoogleBrain; Google, Inc.

Page 9: Inferno Scalable Deep Learning on Spark

Spark & DLHow they could be an ideal tandem, but there are challenges…

Page 10: Inferno Scalable Deep Learning on Spark

Why do you want to use a cluster to train Deep Neural Networks?

This model took about 22 days to train.

I trained 50x from scratch until I found hyper-parameters that work well.

Deep Learning is SLOW

Page 11: Inferno Scalable Deep Learning on Spark

• Highly scalable

• No relevant hardware limits

• Extensible

Worker 1

Worker 2Worker 3

Master

Two approaches to speed up DLScaling Up Scaling Out

• Superior scaling until fundamental limits of the hardware are reached Max. the number of PCIe lanes Max. read speed of HDD Costs scale up non-linear

(DGX-1 = $129,000)Source: https://developer.nvidia.com/devbox

Page 12: Inferno Scalable Deep Learning on Spark

You already have all your valuable data in Spark/Hadoop

DL (often) requires a lot of data to train

Need a lot of memory

Pre-processing has massive of I/O requirements(disk & network)

More reasons why you would want to useHadoop/Spark for DL?

&

Page 13: Inferno Scalable Deep Learning on Spark

How could you implementDL on Spark?

Worker 1 Worker 2 Worker 3

𝑏2𝑥2+𝑏3𝑥3+… 𝑏2𝑥2+𝑏3𝑥3+… 𝑏2𝑥2+𝑏3𝑥3+…

Master

𝑏2𝑥2+𝑏3𝑥3+…

= mini-batch of data

Draw mini-batch

Map:Compute updated model ineach worker

Reduce:Assemble into “better” modelvia Master node

Broadcast “better” modeland repeat

Spark RDD

𝑏2𝑥2+𝑏3𝑥3+…

Page 14: Inferno Scalable Deep Learning on Spark

Compute5%

Commu-nication

95%

Problem 1:Big Parameters = High shuffle cost!

Worker 1 Worker 2 Worker 3

𝑏2𝑥2+𝑏3𝑥3+… 𝑏2𝑥2+𝑏3𝑥3+… 𝑏2𝑥2+𝑏3𝑥3+…

Master

𝑏2𝑥2+𝑏3𝑥3+…

Reduce models(at best 5 s over 1 GbE)

Broadcast combined model(at best 5 s over 1 GbE)

500 MB 500 MB 500 MB

500 MB

Compute updated models(typically 50 – 500 ms)

Page 15: Inferno Scalable Deep Learning on Spark

Problem 2:Node communication is synchronous

Worker 1 Worker 2 Worker 3

𝑏2𝑥2+𝑏3𝑥3+… 𝑏2𝑥2+𝑏3𝑥3+… 𝑏2𝑥2+𝑏3𝑥3+…

Master

𝑏2𝑥2+𝑏3𝑥3+…

Bottleneck!

Page 16: Inferno Scalable Deep Learning on Spark

Blaze

La Trobe University DL-SystemCluster Single Machine

BlazeScala based standalone deep learning system

CUBlaze CUBlazeGPU acceleration for Blaze

Inferno

InfernoCoordinates distributed

computation of Blaze models in synchronous

Spark environment

Page 17: Inferno Scalable Deep Learning on Spark

A (probably biased) comparisonInferno SparkNet (Caffe) CaffeOnSpark deeplearning4j H2O

ConvNets, AutoEncoders, etc. planned

Communication protocol during training Spark MR Spark MR MPI/RDMA Spark MR among

others Grpc/MPI/RDMA

Build Complex models (e.g. ResNet) some

Dynamic branching support(Path altering / dropping)

Pluggable preprocessing Pipeline partial

Pluggable update policies forhyper parametersPluggable & visualizableonline cross validationEntire execution path determined in single runtime environment

Model description language JVM Code Config File Config File JVM Code multiple

GPU acceleration

Page 18: Inferno Scalable Deep Learning on Spark

Blaze

CUBlaze

Inferno

BlazeHigh-Performance Deep Learning Engine

Page 19: Inferno Scalable Deep Learning on Spark

Module Library• Standard Modules

Add-Bias (C/U/S/B), Immediate-Filter (C/U/S/B) Convolution, Convolution-Decoder, Linear, Linear-

Decoder, Locally-Connected, Locally-Connected-Decoder

L2-Pooling, Max-Pooling, Mean-Pooling, Batch-Normalization , Dropout, LCN, LRN, Normalization

(C/U/S/B), Reshape, Weight-Decay (L1/L2)

• NonlinearitiesAbs, Add-Noise, ELU, Exp, Hard-Tanh, LeakyReLU, Ln, Pow, PReLU, ReLU, ReQU, (Log-)Sigmoid, SmoothAbs, (Log-)Softmax, SoftPlus, Sq, Sqrt, SReLU, Tanh

• OptimizersAdaDelta, AdaGrad, Adam, ConjugateGradientDescent, Rprop, RMSProp,SGD (traditional, local learning rates, momentum)

• Constraints (can inject everywhere!)BCE, ClassLL, ClassNLL, KLDivergence, MSE

• ContainersSequence, Auto-Encoder, Branch (Parallel)

• BranchingAlternate-Path, Drop-Path, Random-Path

• Tensor Tables OperationsSelect, Concatenate (C/U/S/B), Merge (add/mean/lerp)

• Visualization & BenchmarkingBenchmark-Wrapper, Visualize-HistogramVisualize-MeanAndStdDev (C/U/S/B)

C/U/S/B = These operations can be applied either on [C]hannel, [U]nit, [S]ample or [B]atch level.

Page 20: Inferno Scalable Deep Learning on Spark

Performance – AlexNet OWT

All benchmarks done using NVIDIA TitanX GPUs on comparable setups; Source: https://github.com/soumith/convnet-benchmarks

Torch(CuDNN)

TensorFlow CUBlaze(1 GB WS

Limit)

Torch(fbfft)

cudaconvnet2 Caffe(native)

Torch-7(native)

27 2637 31

42

121132

53 55 5672

135

203 210

forward (ms) backward (ms)

Page 21: Inferno Scalable Deep Learning on Spark

Performance – VGG A

Torch(CuDNN)

TensorFlow CUBlaze(1 GB WS

Limit)

Torch(fbfft)

cudaconvnet2 Caffe(native)

Torch-7(native)

162 158 167

355408

323 350331382 378

737

821745 755

forward (ms) backward (ms)

All benchmarks done using NVIDIA TitanX GPUs on comparable setups; Source: https://github.com/soumith/convnet-benchmarks

Page 22: Inferno Scalable Deep Learning on Spark

Cached Sample

…Cached Sample

Cached Sample

How Blaze works (example)

Prefetcher

Model(fprop only)

Augmenter

Weights(fixed)

Sample Merger

Data Source(HDD, SparkRDD, HDFS)

Optimizer

Model

Weights(tunable)

Hyper Param.

Hyper Param.

Objectives

Hyper Param.

ScopeDelimiter

Terminal,File,

Showoff,etc.

Page 23: Inferno Scalable Deep Learning on Spark

Easy Setup: Model

• Blaze automatically infers most layer parameters based on the actual input

• Usually no need to specify input and output dimensions or whether to use CPU or GPU

val noClasses = 100

// Kernelsval kernelConv1 = Kernel2D(dims = (11, 11), stride = (4, 4), padding = (2, 2))val kernelConv2 = Kernel2D.centered((3, 3))val kernelPool = Kernel2D((3, 3), (2, 2))

// Layersval bias = AddBiasBuilder()val relu = ReLUBuilder()val lrn = LateralResponseNormalizationBuilder(n = 5, k = 2, alpha = 1e-4f, beta = 0.75f)val pool = MaxPoolingBuilder(kernelPool)

// Lego!val mb = SequenceBuilder(

ConvolutionFilterBuilder(kernelConv1, 48), bias, relu, pool, lrn,ConvolutionFilterBuilder(kernelConv2, 192), bias, relu,

ConvolutionFilterBuilder(kernelConv2, 128), bias, relu, pool, ReshapeBuilder.collapseDimensions(), LinearBuilder(noClasses), bias, SoftmaxBuilder(), ClassLLConstraintBuilder())

Page 24: Inferno Scalable Deep Learning on Spark

Easy Setup: CPU and GPU• Blaze maintains a variant table for each module type.

• When you “build” an instance of a module, all variants are scored and the “best” variant for the current situation is selected automatically. You can configure what “best” means.

// Input dataval data = Array[Batch](...)

// Inspect batchesval hints = BuildHints.derive(data)

// Build compatible modelval m = mb.build(hints)

19:25:20 INFO Scoring ConvolutionFilter[Kernel2[(3, 3), (1, 1)] x 2, 0/1 = filter]:19:25:20 DEBUG 0000800a => CUDA_CUDNN, preferred, input type matches19:25:20 DEBUG 0000400a => JVM_BLAS_IMPLICITMM, preferred19:25:20 DEBUG 00000004 => JVM_BLAS_MM19:25:20 DEBUG 0000000a => JVM_BREEZE_MM, preferred19:25:20 DEBUG 00000002 => JVM_BREEZE_SPARSEMM19:25:20 INFO CUDA_CUDNN selected!

Page 25: Inferno Scalable Deep Learning on Spark

Working with large models!

val mb = SequenceBuilder(...)val hints = ...val g = mb.toGraph(hints)SvgRenderer.render(g)

Page 26: Inferno Scalable Deep Learning on Spark

Visualizingpre-processingpipelines

val apb = AsynchronousPrefetcherBuilder(...)val g = apb.toGraph()SvgRenderer.render(g)

Page 27: Inferno Scalable Deep Learning on Spark

Easy Setup: Optimizerval ob = MomentumBuilder()

// Configure Hyper-Parametersob.learningRate = DiscreteStepsBuilder(

0 -> 1e-2f,40000 -> 1e-3f,80000 -> 1e-4f

)

// Setup Objectivesob.objectives += IterationCountLimitBuilder(1000) += CrossValidationBuilder(dataSource, ... preprocessing pipeline ...) += PrintStatusBuilder() >> FileSinkBuilder(HadoopFileHandle.userHome ++ "results/optimization.log") += objectives.Presets.visualizePerformance() >> ShowoffSinkBuilder("Cross Validation Performance")

// Add more advanced stuff like Regularizers...

// Go!val o = ob.build(m, dataSource)o.run()

Page 28: Inferno Scalable Deep Learning on Spark

Other Features• Tensor Memory Management

Automatically monitors the dependencies between all tensors Reallocates space occupied by unneeded tensors on the fly Will automatically toggle “inPlace” processing when it is safe

• Intermediate results are stored separate from model Forward passes yield backpropagation contexts that can be consumed or discarded

at any time. Very interesting property for:

Live Query/Training Fancy Optimizers Hyper Parameter Search

Saves up to

40% GPU memory

during training!

Page 29: Inferno Scalable Deep Learning on Spark

Blaze

CUBlaze

Inferno

InfernoTraining Deep Learning Models fasterwith Apache Spark

Page 30: Inferno Scalable Deep Learning on Spark

Starting an Inferno cluster

SparkConf

ClusterCoordinator

Worker 1

Worker 2Worker 3

Master

Cluster

FileRDD

Spark BinaryRDD Inferno FileRDD

689 s

6 s

9999 s

35 s

Loading meta-data of HDFS filesClaim

Assess

TailorSpark

Context

SampleDataRDD

Load hdfs://…

Create Samples

Load Plugins(e.g.

CUBlaze)

Page 31: Inferno Scalable Deep Learning on Spark

run()build()

cache()

cache()

cache()

Distributed Optimizer

Blaze Model

Blaze Optimizer

Pre-processing Pipeline

InfernoOptimizer

SampleDataRDD

ClusterCoordinator

Weights Hyper Param.

Objectives

Hyper Param.

ScopeDelimiter

Hyper Param.

Objectives

ScopeDelimiter

Worker 1

Worker 2Worker 3

Master

Cluster

Applied with cluster wide focus.

Applied independently in each worker.

Page 32: Inferno Scalable Deep Learning on Spark

57 minutes2 hours, 42 minutes

PerformanceResNet 34 on ImageNet

Blaze2 x 8 core Xeon CPU + 1 x NVIDIA TitanX

Inferno (over 1 GbE)8 x 8 core Xeon CPU + 4 x NIVIDA TitanX

Reached 20% Top1 accuracy 2.84 times faster!

Page 33: Inferno Scalable Deep Learning on Spark

PerformancePreAct ResNet 152 on ImageNet

0 h 5 h 10 h 15 h 20 h 25 h 30 h 35 h 40 h 45 h 50 h0%

10%

20%

30%

40%

50%

60%

70%

80%

1x TitanX - Top 1 Accuracy1x TitanX - Top 5 AccuracyInferno Cluster (5x TitanX, 1 GbE) - Top 1 Ac-curacyInferno Cluster (5x TitanX, 1 GbE) - Top 5 Ac-curacy

Reached 30% Top1 accuracy 4.81 times faster using 5 GPUs!** vs.

Page 34: Inferno Scalable Deep Learning on Spark

Conclusion• Blaze & CUBlaze

Fast Huge extensible module library Easy to use

• Inferno Allows you to accelerate Blaze DL tasks on Spark Uses Spark MR methods for all data transmissions:

Can run rather nicely along with other Spark jobs. Can be used without high-speed / low latency equipment

(usually required to make RDMA solutions perform well) Plain old (and even slow) Ethernet is enough!

* Note that using “Showoff” to visualize progress may open separate HTTP connections to the Showoff-Server.

Page 35: Inferno Scalable Deep Learning on Spark

Where can I get it?• Blaze & CUBlaze & Example Code

Stable, we train models using it for months already. A snapshot of the current stable release is available at:https://github.com/bashimao/ltudl (Apache License 2.0)

• ShowoffMulti-purpose live visualization system developed by Aiden Nibali (La Trobe University):https://github.com/anibali/showoff

• Inferno I am writing a paper about Inferno’s optimization system right now. Once it has been accepted for publication, we will release the full source code on GitHub. The best way to prepare for Inferno, is to download Blaze now and to get familiar with it.

Page 37: Inferno Scalable Deep Learning on Spark

Deep Learning & Spark @ LaTrobeStudents• Master of Data Science degree

http://tinyurl.com/hf4wmn2 Advanced data science lab established in 2016 with newest

hardware. CSE5BDC

Big Data Management on the Cloud (I tutor this!) CSE5DEV

Data Exploration and Visualization(~50% lectures on deep learning)

CSE5WDCWeb Development on the Cloud

• Research GPU research cluster capable of running distributed deep learning

tasks. In-house development of a distributed deep learning system. Dedicated research group working with various Deep Learning

systems. CSE4DLJ

Weekly Deep Learning Journal Club

Industry• If you have a data analytics problem:

… we have a dedicated deep learning research team!

… and probably also a deep learning solution for it!

• Spark & Deep Learning workshops for Torch available on demand.

• Past & current machine learning research collaborations Alfred Hospital ZenDesk AIS (Australian Institute for Sports)

• Contact: [email protected]