Analysis and Comparison of Distributed Training …1224181/...Analysis and Comparison of Distributed Training Techniques for Deep Neural Networks in a Dynamic Environment ERMIAS GEBREMESKEL

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2018

Analysis and Comparison of Distributed Training Techniques for Deep Neural Networks in a Dynamic Environment

ERMIAS GEBREMESKEL

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Analysis and Comparison ofDistributed TrainingTechniques for Deep NeuralNetworks in a DynamicEnvironment

ERMIAS GEBREMESKEL

Master in Computer ScienceDate: June 26, 2018Supervisor: Håkan Lane, Jim Dowling, & Robin AnderssonExaminer: Örjan EkebergSwedish title: Analys och jämförelse av distribueradeträningstekniker för djupa neurala nätverk i en dynamisk miljöSchool of Computer Science and Communication

iii

Abstract

Deep learning models’ prediction accuracy tends to improve with thesize of the model. The implications being that the amount of computa-tional power needed to train models is continuously increasing. Dis-tributed deep learning training tries to address this issue by spreadingthe computational load onto several devices. In theory, distributingcomputation onto N devices should give a performance improvementof xN. Yet, in reality the performance improvement is rarely xN, due tocommunication and other overheads. This thesis will study the com-munication overhead incurred when distributing deep learning train-ing.

Hopsworks is a platform designed for data science. The purposeof this work is to explore a feasible way of deploying distributed deeplearning training on a shared cluster and analyzing the performanceof different distributed deep learning algorithms to be used on thisplatform.

The findings of this study show that bandwidth-optimal commu-nication algorithms like ring all-reduce scales better than many-to-onecommunication algorithms like parameter server, but were less faulttolerant. Furthermore, system usage statistics collected revealed a net-work bottleneck when training is distributed on multiple machines.This work also shows that it is possible to run MPI on a hadoop clus-ter by building a prototype that orchestrates resource allocation, de-ployment, and monitoring of MPI based training jobs. Even thoughthe experiments did not cover different cluster configurations, the re-sults are still relevant in showing what considerations need to be madewhen distributing deep learning training.

Keywords: deep learning, large scale distributed deep learning,data parallelism.

iv

Sammanfattning

Träffsäkerheten hos djupinlärningsmodeller tenderar att förbättras irelation med storleken på modellen. Implikationen blir att mängdenberäkningskraft som krävs för att träna modeller ökar kontinuerligt.Distribuerad djupinlärning försöker lösa detta problem genom att dis-tribuera beräkningsbelastning på flera enheter. Att distribuera beräk-ningarna på N enheter skulle i teorin innebär en linjär skalbarhet (xN).I verkligenheten stämmer sällan detta på grund av overhead från nät-verkskommunikation eller I/O.

Hopsworks är en dataanalys och maskininlärningsplattform. Syftetmed detta arbeta är att utforska ett möjligt sätt att utföra distribueraddjupinlärningträning på ett delat datorkluster, samt analysera prestan-dan hos olika algoritmer för distribuerad djupinlärning att använda iplattformen.

Resultaten i denna studie visar att nätverksoptimala algoritmersåsom ring all-reduce skalar bättre för distribuerad djupinlärning änmånga-till-en kommunikationsalgoritmer såsom parameter server, menär inte lika feltoleranta. Insamlad data från experimenten visade på enflaskhals i nätverket vid träning på flera maskiner. Detta arbete visaräven att det är möjligt att exekvera MPI program på ett hadoopklustergenom att bygga en prototyp som orkestrerar resursallokering, dis-tribution och övervakning av exekvering. Trots att experimenten intetäcker olika klusterkonfigurationer så visar resultaten på vilka faktorersom bör tas hänsyn till vid distribuerad träning av djupinlärningsmo-deller.

Contents

1 Introduction 11.1 Research Question . . . . . . . . . . . . . . . . . . . . . . 21.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Sustainability and Relevance . . . . . . . . . . . . . . . . 3

2 Background 52.1 Training Neural Networks . . . . . . . . . . . . . . . . . . 5

2.1.1 Stochastic Gradient Descent (SGD) . . . . . . . . . 62.2 Distributed Training in Deep Learning . . . . . . . . . . . 7

2.2.1 Distributing SGD using MapReduce . . . . . . . . 82.3 Algorithms for Collective Communication . . . . . . . . 8

2.3.1 Message Passing Interface (MPI) . . . . . . . . . . 92.4 Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4.1 Synchronous SGD . . . . . . . . . . . . . . . . . . 102.4.2 Asynchronous SGD . . . . . . . . . . . . . . . . . 112.4.3 Parameter Server . . . . . . . . . . . . . . . . . . . 112.4.4 Ring All-Reduce . . . . . . . . . . . . . . . . . . . 12

2.5 Model Parallelism . . . . . . . . . . . . . . . . . . . . . . . 132.5.1 Partitioning Neural Network Model Graphs . . . 142.5.2 Device Placement Optimization . . . . . . . . . . 15

2.6 Hybrid Data and Model Parallelism . . . . . . . . . . . . 152.7 TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.7.1 Distributed TensorFlow . . . . . . . . . . . . . . . 162.8 Resource Management in Hops-YARN . . . . . . . . . . . 162.9 Spark on YARN . . . . . . . . . . . . . . . . . . . . . . . . 17

2.9.1 TensorFlow On Spark . . . . . . . . . . . . . . . . 182.10 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.11 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 19

v

vi CONTENTS

3 Method 203.1 Datasets and Models . . . . . . . . . . . . . . . . . . . . . 203.2 Cluster Setups . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.1 Hardware Specification . . . . . . . . . . . . . . . 213.2.2 Distributed Deep Learning and Big data Frame-

works . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 Experiment Design . . . . . . . . . . . . . . . . . . . . . . 23

3.3.1 Batch Size . . . . . . . . . . . . . . . . . . . . . . . 243.3.2 Number of Workers . . . . . . . . . . . . . . . . . 243.3.3 System Monitoring . . . . . . . . . . . . . . . . . . 24

3.4 Model Deployment . . . . . . . . . . . . . . . . . . . . . . 243.4.1 Parameter Server with TensorFlowOnSpark . . . 253.4.2 Ring All-Reduce with Horovod . . . . . . . . . . . 263.4.3 Evaluation of Model Deployment . . . . . . . . . 30

3.5 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . 31

4 Results 334.1 Parameter Server . . . . . . . . . . . . . . . . . . . . . . . 33

4.1.1 Synchronous Update . . . . . . . . . . . . . . . . . 334.1.2 Asynchronous Update . . . . . . . . . . . . . . . . 344.1.3 Multi-Node Synchronous Update . . . . . . . . . 34

4.2 Ring All-Reduce . . . . . . . . . . . . . . . . . . . . . . . . 344.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.4 Resource Utilization . . . . . . . . . . . . . . . . . . . . . 384.5 Model Deployment . . . . . . . . . . . . . . . . . . . . . . 38

5 Discussion and Conclusion 435.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.1.1 Number of Parameter Servers . . . . . . . . . . . . 445.1.2 Asynchronous Update . . . . . . . . . . . . . . . . 455.1.3 Ring All-Reduce . . . . . . . . . . . . . . . . . . . 45

5.2 Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . . 475.3 Resource Utilization . . . . . . . . . . . . . . . . . . . . . 475.4 Model Deployment . . . . . . . . . . . . . . . . . . . . . . 485.5 Possible Sources of Error . . . . . . . . . . . . . . . . . . . 485.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.7 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Bibliography 51

CONTENTS vii

A Complete Results 58

List of Figures

2.1 Parameter server model . . . . . . . . . . . . . . . . . . . 122.2 Model parallelism . . . . . . . . . . . . . . . . . . . . . . . 14

3.1 Cluster setup . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 Starting and monitoring MPI . . . . . . . . . . . . . . . . 31

4.1 Processed images/sec using parameter server sync up-date mode . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2 Processed images/sec using parameter server async up-date mode . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3 System usage parameter server multi-node . . . . . . . . 374.4 Processed images/sec using ring all-reduce . . . . . . . . 384.5 Speedup with parameter server . . . . . . . . . . . . . . . 394.6 Speedup with ring all-reduce . . . . . . . . . . . . . . . . 404.7 Ring all-reduce and parameter server system usage . . . 414.8 One and two parameter servers system usage . . . . . . . 42

5.1 InfiniBand usage in ring all-reduce . . . . . . . . . . . . . 46

viii

List of Tables

3.1 Models used in experiments . . . . . . . . . . . . . . . . . 213.2 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . 32

A.1 Parameter server sync update results . . . . . . . . . . . . 58A.2 Two Parameter servers sync update results . . . . . . . . 59A.3 Parameter servers async update results . . . . . . . . . . 59A.4 Ring all-reduce results . . . . . . . . . . . . . . . . . . . . 60

ix

List of Listings

1 Sample cluster specification . . . . . . . . . . . . . . . . . 262 Sample mpirun hostfile . . . . . . . . . . . . . . . . . . . . 263 Sample mpirun script . . . . . . . . . . . . . . . . . . . . . 274 System usage commands . . . . . . . . . . . . . . . . . . . 27

x

Chapter 1

Introduction

Statistical machine learning initially operated on handcrafted featuresextracted from datasets by human experts. The handcrafted featureswere then used to train a model using methods like maximum likeli-hood, Support Vector Machines, k-Nearest Neighbors, k-means, deci-sion trees, and regression algorithms. This approach works for smalltasks where the informative features of a dataset are easy to identify.Unfortunately, identifying informative features from real-world prob-lems like computer vision, speech recognition, and natural languageprocessing is extremely difficult.

Deep neural networks attempt to solve this problem by includingfeature extraction in the learning process. This learning of hierarchi-cal features from raw data with no task-specific prior knowledge ismuch harder than training prediction models, therefore requiring sig-nificantly more training data and computational resources. The recentsuccess in deep learning (DL) is thus largely attributed to the advancesin computing capability and availability of large amounts of labeleddata [18, 26, 27]. Even when done on GPUs, training deep learningmodels on large datasets can take excessively long time if done on asingle machine. Furthermore, training on a single machine with lim-ited resources will restrict the size of the model that can be trained.

Distributed deep learning tries to address these limitations by de-composing a machine learning problem onto multiple machines anddevices. The decomposition can be done across two dimensions: (1)data dimension and (2) model dimension. However, scaling up is notmerely adding computational resources. The main consideration tomake when distributing computation onto multiple machines is com-

1

2 CHAPTER 1. INTRODUCTION

munication cost. In particular the ratio of computation to data man-agement housekeeping. Distributed deep learning training is only effi-cient if the parameters being communicated are also computationallyexpensive. Amdahl [4] speculates that in a parallel program if datamanagement overhead average around 10% of the operation it willforce at least 25% of the computation to be sequential.

In this paper, the performance and scalability of state-of-the-artdistributed training algorithms will be empirically analyzed and theirperformance compared in a shared cluster environment with container-based resource management.

The rest of this paper is organized as follows. Chapter 2 gives thenecessary background to follow this paper and some related works,Chapter 3 presents the research design and performed experiments,and Chapter 4 reports the findings of these experiments. Finally, Chap-ter 5 discusses the results and presents conclusions based on the anal-ysis.

1.1 Research Question

Accuracy of deep learning models improve with the amount of dataused for training and the size of the model [17, 11, 10, 28], requiringmore computational resources.

Researchers and professionals that have access to GPU clusters typ-ically use them for: (1) running parallel experiments (for example,to establish good hyper-parameters, learning rate, number of layers,choice of model architecture, etc) [6] and (2) expensive training jobs,distributed over many GPUs on many servers [23].

Available deep learning frameworks like TensorFlow [1], Caffe [24],Keras [9], Torch [12], and others use different approaches to scale outand speed up training through distribution. Some, like Tensorflow,Caffe2, and Torch, have native support for distributed training, whileothers, like Caffe(1.0), relay on Apache Spark [47], a cluster computingframework that uses a similar programming model to MapReduce [14]to handle distribution. However, choosing the right distributed train-ing system architecture for a given model is far from trivial. Whentraining deep neural networks in a distributed manner, one need toknow:

1. If the performance is affected by the number of parameter servers?

CHAPTER 1. INTRODUCTION 3

2. If updating parameters in an asynchronous manner improve per-formance?

3. How the performance scales with each added worker when us-ing ring all-reduce compared to parameter server?

4. How easy it is to deploy a model, with the chosen distributedtraining system architecture, on a cluster with shared resources?

In this work, these questions will be analyzed for two distributed train-ing system architectures, to identify bottlenecks and limitations in themethods. This paper will also explore mechanisms for deploying dis-tributed deep learning training on a cluster with shared resources.

1.2 Scope

Running parallel experiments to establish good hyper-parametersand/or model architecture is a less interesting problem to analyze be-cause: (1) there is no communication between processes running inparallel and (2) the amount of training data needed is significantly lessthan the actual model training. Thus, the analysis in this paper willonly be concerned with expensive training jobs that can benefit fromdistributed training. Focusing mainly on the communication overheadincurred when training is distributed. When distributing deep learn-ing training some loss of accuracy might occur or more training maybe needed to achieve the same level of accuracy as in a non-distributedtraining. In this work the loss of accuracy that might occur when dis-tributing training is not investigated. Moreover, the performance ofdistributing deep learning training can be affected by other factors,like disk I/O that is ignored in this work.

Although deep learning can refer to any neural network with alarge number of layers and parameters, here we will only be concernedwith models trained using some variant of Stochastic Gradient De-scent (SGD) to limit the scope but the analysis should generalize tomost deep learning models.

1.3 Sustainability and Relevance

Distributing training requires a lot of computational resources. Thus,economic and environmental costs of using these resources need to be

4 CHAPTER 1. INTRODUCTION

considered. This work will help in identifying the cost of distributionand if this cost is justified by the gain. This is done by showing howmuch speedup is achieved with each added computational resource.This study will also help in identifying bottlenecks in system setupsthat can hamper the performance gain that can be obtained by distri-bution.

The company Logical Clocks AB is a startup with the product Hado-op Open Platform-as-a-Service (Hops), a distribution of Apache Hado-op [46] that provides a web-based application to manage datasets andinteractively analyze them. The Hops web-based application(Hopsworks) also provides CPU, GPU, and storage as managed re-sources on a cluster. This work is interesting to Logical Clocks be-cause distributed deep learning is one of the services on Hopsworksand need to identify any limitations or bottlenecks in the cluster cur-rently hosting the platform.

This work is also relevant for anyone that wants to distribute deeplearning training, by showing the limitations and the considerationsthat need to be made when choosing distributed training architecture.

Chapter 2

Background

Artificial Neural Networks (ANN) are computational models (algo-rithms or actual hardware) that are modeled after a highly simplifiedmammalian cerebral cortex neuronal structure. The basic computationunit of a neural network is the perceptron, a simplified neuron modelcapable of finding decision surface for linearly separable patterns. Bylayering two or more of these perceptrons in a feed-forward mannerneural networks can approximate arbitrary functions [22]. These mul-tilayered perceptrons are usually trained using backpropagation, firstintroduced by Rumelhart et al. [36]. Backpropagation works faster thanpreviously known approaches to learning, making deeper networksfeasible. Deeper networks increase the ANN’s memorization capacitymaking learn data representations possible, which is the basis for deeplearning.

Deep learning, also modeled after a highly simplified neocorticalneuronal structure, attempt to mimic neural coding. The success ofdeep learning models such as deep neural networks (DNN), deep be-lief networks (DBN) and recurrent neural networks (RNN), dependson their ability to learn data representation. Training these models ismainly done through backpropagation using gradient descent and re-quires significantly more training data and computational resourcesthan shallow networks.

2.1 Training Neural Networks

Training neural networks in a supervised manner is done through theBackpropagation algorithm comprising two phases: the forward pass

5

6 CHAPTER 2. BACKGROUND

which maps input to output and backward pass which takes the out-put calculate the error by comparing with the desired output and back-propagate it all the way to the input node. The weights associatedwith individual links in the network are updated proportional to theportion of the error propagated back.

The weight updates can be done either in batches (where all thetraining samples pass through the algorithm before weights are up-dated) or after each training sample, presented in random order. Stochas-tic Gradient Descent (or some flavors of it, like AdaGrad, RMSProp,or Adam) is the most commonly used update rules (optimizers) fortraining deep neural networks [10, 7]. This process is repeated for alltraining samples multiple times (epochs) until convergence, i.e. whenthe error drops below a given value or when validation error starts togo up.

Most of the operations in Backpropagation algorithm can be per-formed as matrix-vector operation, which is highly parallel and nu-merically intensive. Thus can be offload to the many-cores of a GPU,that can perform floating-point arithmetic at a much higher rate thanCPUs.

2.1.1 Stochastic Gradient Descent (SGD)

The batch gradient descent algorithm starts with a randomly initial-ized parameter θ and repeatedly updates it with the gradient of theobjective function J(θ) :

θj := θj − α∂

∂θjJ(θ). (for j = 0, ..., n) (2.1)

where α is the learning rate. For least mean squares (LMS) cost func-tion:

J(θ) =1

2(hθ(x)− y)2 (2.2)

∂

∂θjJ(θ) = (hθ(x)− y)xj (2.3)

For m training examples the update rule can be written as:

θj := θj + α1

m

m∑i=0

(yi − hθ(xi))xij (for all j) (2.4)

CHAPTER 2. BACKGROUND 7

Therefore, the computational cost of gradient descent scales lin-early with the training dataset size (m in 2.4). Stochastic Gradient De-scent approximates gradient descent by randomly sampling a singledataset at uniform and performing the update. For i ∈ (0, ...,m) theupdate rule is given by:

θj := θj + α(yi − hθ(xi))xij (for all j) (2.5)

Mini-batch gradient descent is another update rule that uses b ran-dom examples to perform the parameter update. For mini-batch gra-dient descent equation 2.5 can be rewritten as:

θj := θj + α1

b

i+b∑k=i

(yk − hθ(xk))xkj (for all j) (2.6)

for i := 0, ..., (m − b) and 1 < b < m. When b = 1 this algorithm isidentical to SGD (equation 2.5) and when b = m it is batch gradientdescent (equation 2.4).

2.2 Distributed Training in Deep Learning

Deep Learning models need to be trained on big datasets. Even whentraining is done on GPUs it can take days or even weeks, if done on asingle machine. There are two dimensions that deep learning modeltraining can be parallelized over: data parallelism (across data di-mension), where each machine contains a complete model replica butprocess only part of the data samples and model parallelism (acrossmodel dimension), where different parts of a model run on differentmachines in parallel. In both schemes some kind of synchronizationbetween workers is required: in model parallelism neuron activitiesneed to be communicated, while in data parallelism model parame-ters (weights and biases) are communicated to ensure all models aretrained evenly [25].

The performance of these parallelization schemes is highly depen-dent on model architecture. Notably, parallelism is only efficient if theparameters (units) being communicated are also computationally ex-pensive. Therefore, when per weight computation is high data paral-lelism is more efficient, while model parallelism is more efficient whenthe neuron activity is computationally expensive [25].


To make full use of distributed training both dimensions of paral-lelism should be exploited to a degree taking into account the commu-nication architecture of the model. Other factors like weight updateschemes, batch size, and communication algorithms can also affect theperformance of these parallelization schemes.

2.2.1 Distributing SGD using MapReduce

The MapReduce [14] approach is applicable to problems that can be ex-pressed as computing sums of functions over a training dataset. Map-reduce refers to two separate tasks: (1) the map job and (2) the reducejob. The map job takes the data and computes a result in parallel onmultiple workers that can be spread across multiple devices or ma-chines. The reduce job then takes the output from all mappers andcombines them to give the cumulative result. The mini-batch gradientdescent algorithm given in equation 2.6 can be distributed using themap-reduce paradigm onto n machines as follows:

Machinelj :=

i+b/n∑k=i

(yk − hθ(xk))xkj (for all j) (2.7)

where l := 1, ..., n. Then a single reduce job will calculate:

θj := θj + α1

b

n∑l=1

Machinelj (for all j) (2.8)

The parameter θ is then broadcast to all machines to calculate the nextiteration of the loss function. This will potentially give us a×n speedupif there was no network latency.

2.3 Algorithms for Collective Communication

Collective communication algorithms are a set of algorithms designedfor communication involving multiple processes. The three interest-ing collective communication algorithms for this study are: One-to-allbroadcast (when parameter server broadcast the global model), All-to-one reduction (when all workers push their gradients to be reducedby the parameter server), and All-reduce. All algorithms in collective


communication try to minimize the overhead of point-to-point mes-sage passing. The communication overhead of point-to-point messagepassing is given by:

overhead = ts + twm (2.9)

where ts is the latency, tw is per word transfer time (inverse band-width), and m is the message size in number of words. If the linkis bi-directional and used by more than one message then: tw → tw/M ,where M is the number of messages that share the link ( i.e. the trans-fer time will be shared by M messages).

This shows that distributed training is highly dependent on theamount of data that needs to be sent, communication channel through-put, and number of workers using the communication channel.

Unlike all-to-one reduction, all-reduce algorithms are independentof the number of processes that need to communicate. In all-reduce,each worker communicate with two neighbors; receiving a messagefrom a neighbor on one side combines it with local message and passesit to its neighbor on the other side. Hence the message size that sharesthe same link does not grow with the number of processes and tw →tw/1 for all M .

2.3.1 Message Passing Interface (MPI)

The Message Passing Interface (MPI) standard is a cross-platform APIthat allows for distributed-memory parallel programs to exchange databy abstracting the underlying network procedures. MPI is available onsingle compute nodes for inter-process communication and on clustersfor communication between nodes.

MPI can take advantage of high performance interconnects, withhigh throughput and low latency, such as InfiniBand, Intel Omni-Path,and Cray interconnects that are not available in Spark, MapReduce,and gRPC (Google RPC, used by distributed TensorFlow). MPI hascorresponding functions for all the collective communication algorithmsmentioned above: MPI_Bcast (One-to-all broadcast), MPI_Reduce (All-to-one reduction), MPI_Allreduce (All-reduce), and others. Open MPI[15] is an open source implementation of MPI developed and main-tained by the High performance Computing community. Open MPIis used in implementations of parallel and distributed deep learning


training frameworks like Baidu’s contribution to TensorFlow (tensorflow-allreduce), Theano-MPI [31] , and horovod [38].

2.4 Data Parallelism

Scalability challenges that arise from the massive data volume usedto train deep learning models can be alleviated leveraging data paral-lelism. In data parallelism, a replica of the same model run on multi-ple workers on different subsets of the training data, in parallel. Thus,data parallelism requires keeping a global model and some way of up-dating its parameters by gathering results from workers.

Stochastic gradient descent (SGD) is an inherently sequential algo-rithm [49]. To perform SGD in parallel, a global model is maintainedand workers send their gradients, which are aggregated, to the globalmodel. This gradient update can be performed in two different ways:

• Synchronous SGD where gradient updates are made when all work-ers are done calculating their respective gradients, and

• Asynchronous SGD where gradient updates are made incremen-tally as local workers finish calculating their respective gradients.

Finally, the updated global model is broadcast to all workers. Thus,techniques common in Collective Communication and high-performancecomputing (HPC) are highly relevant for model gradient update andpropagation.

2.4.1 Synchronous SGD

In Synchronous SGD, a global model that is updated with the aggre-gate of all worker gradients is kept. This updated global model isthen sent to all workers, making Synchronous SGD a true mini-batchstochastic gradient descent; where the mini-batch size is the sum of themini-batches of all workers.

Synchronous SGD has two obvious drawbacks: (1) the update timeis dependent on the slowest worker and (2) the scalability of the ap-proach is dependent on the batch size of each worker (i.e. bigger mini-batches on workers will limit the number of workers that can be addedbecause very big batch sizes can affect the convergence rate of SGD).Although training with large mini-batch sizes up to 32k with no lossin accuracy has been demonstrated [16, 3].


2.4.2 Asynchronous SGD

Downpour SGD first introduced in [13] used Asynchronous Stochas-tic Gradient Descent, where each worker pushes its gradients ∆w andgets back an updated global model W , independently of other work-ers.

Asynchronous SGD has two advantages: 1) the performance ofthe algorithm is not affected by a slow worker and 2) it is fault toler-ant. Fault tolerance is a challenge faced when implementing any dis-tributed framework. Long-running applications, like deep learning,are susceptible to faults and require robust and fault tolerant frame-works. Task failure rate of 10k machine-hour jobs of batch machinelearning tasks was reported to be 24.7% in [30].

While Asynchronous SGD solves the bottleneck introduced by theslowest worker in synchronous SGD, it suffers from the problem ofthe delayed gradient. This is encounter when a slow worker pushes itsgradients to the global model and the model is already updated byother workers. Different techniques have been proposed to solve theproblem of delayed gradient [48, 8]. While the delayed gradients cancreate some noise in the global model and delay convergence, deepneural networks can recover and learn successfully [8].

2.4.3 Parameter Server

The parameter server framework introduced in [42] is widely adopted[2, 21, 30] as an efficient solution to scale machine learning algorithms.Parameter server frameworks distribute data and model parametersacross multiple nodes to spread the workload.

In parameter server architecture (shown in Figure 2.1) a distributedkey-value storage is used for synchronizing parameters between work-ers. Parameter servers store all parameters, while workers are state-less but can cache parameters across iterations. In a straight-forwardconfiguration where we have a single parameter server and multi-ple workers; each worker will compute a gradient on its subset ofthe mini-batch and sends it to the parameter server, which then takesthe average of all the gradients and broadcasts it back to all workers.If we imagine a neural network with 100 million trainable parame-ters (which is not uncommon in deep learning), where each parame-ter is four bytes, we need to communicate roughly 400 megabytes ofdata per worker. In the above configuration, where all workers share


Figure 2.1: Parameter server model. Training data is divided acrossmachines each containing a model replica. The parameter server col-lects model gradients ∆W from each machine and returns an aggre-gated model parameter W that is used to update every model replica.(image source [13])

the same bandwidth, the communication cost grows linearly with thenumber of workers. To make further parallelization practical the syn-chronization constraint can be removed (asynchronous SGD) or wecan use communication algorithms with cost independent of the num-ber of workers (all-reduce).

The synchronization constraint is the main bottleneck in this frame-work, so some implementations loosen this constraint to reduce com-munication overhead [20, 30]. While others like Baidu’s ring all-reducetake advantage of bandwidth-optimal communication algorithms with-out loosening synchronization constraints.

2.4.4 Ring All-Reduce

The bottleneck created by sending data to a single parameter serverin all-to-one reduction can be alleviated with ring all-reduce; an algo-rithm common in the field of high-performance computing. In ringall-reduce each worker is assigned two neighboring workers; one tosend data to and another to receive from. The algorithm as presentedin [50] consists of two stages:


1. Scatter-reduce in this step, each worker sends a chunk and re-ceives a chunk from its direct neighbors. After the first datatransfer iteration, each worker will aggregate the chunk it re-ceived with its local copy and do the data transfers stage againuntil each worker has some part of the aggregated final valuethat includes contributions from all workers.

2. All-gather in this step each worker receive the final value fromtheir respective (sender) neighbor. After the data transfer stage,each worker will replace its value with the newly received oneand continue with the data transfer until each worker receivesthe contributions from all other workers.

The ring all-reduce algorithm speed is independent of the numberof workers, instead; it is limited by the slowest communication linkbetween neighboring workers. The ring all-reduce algorithm can beapplied to deep learning in the same way as parameter server frame-work, but instead of sending gradients to a single parameter serverhere it is sent to immediate neighbors. Furthermore, communicationcan be overlapped with the gradient computation, by sending the gra-dients of the output layer while the other layers are being computed[50].

2.5 Model Parallelism

Deep learning models can have billions of parameters [13]. At fourbytes per parameter, these models can reach gigabytes in size and can-not fit into the memory of a single GPU. Model parallelism can be usedto address this problem.

In model parallelism synchronization between workers is done whenone worker needs neuron activities output by another worker as input.To minimize the communication overhead a neural network modelgraph can be partitioned in a way that edges running between sep-arated components (shown in thicker line in Figure 2.2) of the modelare minimal or the amount of data that flows through the edges is low.However; model parallelism, with the goal of minimizing executiontime, can not be achieved solely by partitioning. Operations also needto run in a particular order, to make sure outputs from a task are avail-able when they are needed as input by other tasks (scheduling).


Figure 2.2: Model parallelism. Gray boxes represent devices(GPU orCPU) or machines. The thicker edges show communication across ma-chines or devices. (image source [13])

Both finding an optimal partitioning of a data flow graphs andscheduling are NP-complete problems [32]. However, there are sub-optimal heuristics-based algorithms both for scheduling and partition-ing computational graphs.

2.5.1 Partitioning Neural Network Model Graphs

Graph Partitioning can be used for load balancing tasks onto multipledevices while minimizing communication. Graph Partitioning can bedefined as follows:

Given a directed acyclic graph G = (N,E,WN ,WE), where N =

node, E = edges, WN = node weights, and WE = edge weights. Finda partition that will distribute the load WN evenly, while minimizingthe sum of all edge weights connecting all the different partitions.

In a neural network computational graph: N represent computa-tional operations, WN is the cost of the operation N , and an edge E(i,j)

with weight WE is the amount of data that flows between i and j.


2.5.2 Device Placement Optimization

As shown in [33], graph partitioning algorithms do not produce sat-isfactory placements due to the fact that modeling cost estimate forthe graphs is expensive. [33] instead proposes a reinforcement learn-ing model for device placement optimization, that outperforms bothgraph partitioning algorithms and expert human placements.

Reinforcement Learning for Device Placement Optimization

Reinforcement Learning is a part of machine learning that is inspiredby the reward system (dopamine pathway) of the brain. Unlike su-pervised learning, where training data is labeled with the true class,reinforcement learning agents learn from experience.

By using execution time, of a proposed placement executed on ac-tual hardware, as a reward signal a reinforcement learning agent canlearn to optimize device placement [33]. The agent trained this waywas reported to be 3.5 times faster than SCOTCH’s [35] graph parti-tioning based placement and up to 20% faster than human experts’placements.

2.6 Hybrid Data and Model Parallelism

No one dimension of parallelism is better than the other. Which schemeto use should be informed by the communication architecture of amodel, the amount of training data, and the size of the model. Manydistributed deep learning frameworks use both schemes [8, 34]. Oneuseful observation made in [25] is that convolutional neural networkshave two types of layers; Convolutional layers contain 90-95% of thecomputation, but only about 5% of the parameters, and Fully-connectedlayers contain only about 5-10% of the computation, but 95% of the pa-rameters showing that data parallelism is more appropriate for convo-lutional layers, and model parallelism for fully-connected layers.

2.7 TensorFlow

TensorFlow is an open-source framework for building and deployingmachine learning models. TensorFlow is a second-generation frame-work derived from DistBelief [13]: a Google Brain project. TensorFlow


has support for both CPUs and GPUs (using NVIDIA CUDA) on onenode or a cluster with multiple nodes. Currently it is the only machinelearning framework supported on hops-hadoop’s data managementand analysis platform.

The TensorFlow programming framework contains three basic con-cepts:

1. Tensors are the main computational units in TensorFlow. A tensoris a multidimensional array with a rank to represent its dimen-sion.

2. Graphs represent dataflow between computations in a Tensor-Flow program. Where graph edges and vertices represent dataflowand operations respectively.

3. Session holds information about the TensorFlow graph, runs thecomputations described by the graph, and provides access tohardware resources on local or remote devices.

2.7.1 Distributed TensorFlow

TensorFlow supports distributed training both for data and model par-allelism and it allows both synchronous and asynchronous training.TensorFlow also supports ring all-reduce on a single node using nccl(NVIDIA’s library for collective communication). TensorFlow distributestraining by creating a cluster of tasks that will execute a TensorFlowgraph. Distributed execution is achieved using two objects:

1. Server is used to create a session for each task.

2. Worker executes operations in a graph.

A specification dictionary is used to map job names to network ad-dresses. These job names can then be used in TensorFlow code tospecify what part of the execution will run on which worker. Andthe network addresses are used for communication.

2.8 Resource Management in Hops-YARN

Hops-YARN is a distribution of Apache Hadoop YARN [45] that havemoved the StateStore to the transactional in-memory databases (MySQL


NDB Cluster). An application that is submitted to a hadoop cluster isassigned the required resources and monitored through YARN (YetAnother Resource Negotiator). YARN cluster management includesresource management, monitoring, and scheduling. These tasks areperformed by three separate components:

1. ResourceManager (RM) is the central authority responsible for me-diating resources for applications running in the cluster. Withtasks including, scheduling and resource management.

2. NodeManager (NM) is tasked with reporting, resources availabil-ity, faults, and container lifecycle management (e.g., starting, killing)on each node in a cluster.

3. ApplicationMaster (AM) manages all lifecycle aspects of a job. TheAM can run arbitrary user code and communicates with the RMto issue resource requests that can contain locality preferences.

The Hadoop version (2.8) used in Hops have support for managingMemory and CPU as a resource, but GPUs are not managed nativelyby this version of Hadoop YARN. Hops-hadoop YARN added GPUsas managed resources in 2017 [5].

2.9 Spark on YARN

Apache Spark [47], is a cluster computing framework that uses a simi-lar programming model to MapReduce [14]. Launching an applicationwith Spark involves five processes:

1. Driver launches the job and provides the code that will run onthe workers.

2. Cluster manager is used to ask for resources by Spark. The clustermanager can be: standalone, Mesos, or YARN.

3. Workers is a container that provides resources to the application.

4. Executors is a JVM process that is created on a worker node bySpark.

5. Task is a thread in an executor.


A Spark application can be deployed on YARN in two modes: clustermode or client mode. In cluster mode, the driver process runs in YARNAM. While in client mode the driver runs in the client that started theSpark job.

2.9.1 TensorFlow On Spark

TensorFlowOnSpark (TFoS) [29] is an open-source framework that en-ables distributed training and inference on Spark by using Tensor-Flow’s own distributed deep learning capabilities. This allows TFoSto support all the distributed deep learning methods in TensorFlow.TFoS has support for communication using both Ethernet and Con-verged Ethernet (RoCE). The GPU support in TFoS is not managed,i.e. the program will take any available GPU on a cluster. This makes itunsuitable for shared cluster where GPUs are managed resources. TheHops team have released a distribution of TFoS that comply with theYARN resource management restrictions to use with the hops-hadoopplatform.

2.10 Evaluation

The performance of a distributed training algorithm is measured bythe amount of time it takes to train a model with no significant lossof accuracy. Scalability is another metric that can be used to evalu-ate distributed training algorithms. A distributed training algorithmis scalable: if the running time of the algorithm is reduced with everyadded computational resources and the additional cost is justified bythe performance gain. In an ideal scaling the running time of the al-gorithm improves proportional to the added computational resources(linear scalability).

Benchmarks for distributed training are usually done both withsynthetic and real data. The synthetic data helps to remove disk I/Ofrom the equation. Real data can then be used to verify the result andmeasure disk I/O impact.


2.11 Related Work

Distributing implementation of deep learning training is importantfor efficiency, achieving better accuracy and quicker trial and errorin experiments. Given this fact, much work has been put into build-ing distributed frameworks for training deep networks and analyzingtheir performance on single-GPU, multi-GPU and multi-node environ-ments [39, 40]. Frameworks like Project Adam [8] and SINGA [34] at-tempt to solve the performance and scalability issues of deep learningtraining by building a highly customized solution. Others like Ten-sorFlowOnSpark integrate deep learning frameworks with a general-purpose batch computational framework like Spark and horovod [38]brings HPC techniques to deep learning. As shown in [16, 3] HPCtechniques like ring all-reduce can achieve nearly linear scalability, butthere is no native support in MPI for YARN-based resource manage-ment. Generally available machine learning frameworks like Tensor-flow, Caffe2, and Torch, that support distributed training natively, haveno built-in support for deploying a model in a shared cluster environ-ment.

There is a clear mismatch between shared clusters, managed byYARN, and machine learning frameworks that is addressed by thisproject. This work explored the feasibility of running machine learn-ing frameworks and HPC techniques in a shared cluster managed byYARN, and analyze the performance of different distributed deep learn-ing algorithms in a shared cluster environment.

The frameworks that are used in the analyses are: TensorFlowOn-Spark and Horovod customized for container-based resource manage-ment.

The tests were performed on a 2-node cluster that were configuredwith 10 GPUs each. The GPUs in the two nodes are GeForce GTX 1080Ti (for a total of 20 GeForce GTX 1080 Ti). The nodes are connected viaa 40Gb/s InfiniBand connection and a 1Gb/s Ethernet cable.

Chapter 3

Method

Distributing training of deep learning models is important for achiev-ing better accuracy, allowing for quicker trial and error in experiments,and general efficiency. But to be efficient a distributed training algo-rithm must take into consideration the communication cost incurredwhen training is done in parallel. There are many distributed train-ing architectures that try to address this cost. Here, two widely useddistributed training algorithms, parameter server and ring all-reduce,are analyzed using image recognition models on hops-hadoop clus-ter with 20 GPUs. The analyses is done on the run-time performanceof the models with: different number of parameter servers, updatemodes, and communication algorithms. Given the fact that hops-hadoopclusters are designed for multi-tenancy a mechanism for deployingdistributed training on a shared cluster is also presented.

3.1 Datasets and Models

ImageNet Large Scale Visual Recognition Competition (ILSVRC) [37]wining models are widely used for performance benchmarking in dis-tributed deep learning [3, 16, 39, 40]. InceptionV4 [43], ResNet-50 [19],AlexNet [26], ResNet152_v2 [19], and VGG16(19) [41] are used in theexperiments in this study with synthetic ImageNet dataset to analyzethe performance of different distributed training schemes and mea-sure the impact of the communication cost. Synthetic datasets are ran-domly generated pixel values that match the dimensions of the desireddataset (256× 256 for ImageNet). Using synthetic datasets allow us totest the computational performance and communication overhead by

20

CHAPTER 3. METHOD 21

Table 3.1: Models used in experiments with number of parameters,operations, and accuracy

Models # Parameters FlopsAccuracy (%)top-5 top-1

AlexNet ~60M ~2.27 Bn 80.3 57.0VGG16 ~138M ~30.94 Bn 90.0 70.5VGG19 ~143M ~39 Bn 92.7 71.3Resnet50 ~25M ~10 Bn 92.9 75.8Resnet152 ~60M ~29.4 Bn 93.8 77.6googlenet ~10M ~3 Bn 92.1 <70inception4 ~65M ~20 Bn 95.0 80.0

removing disk I/O from the equation. For the purpose of this paperthe disk I/O impact on scalability is not considered, therefore no testswith real data were carried out.

The models used for the experiments are implementations of theoriginal models in TensorFlow benchmarks [44]. The models werechosen to represent different architectures and number of parameters.The models and their respective number of parameters, floating pointoperations per second, and accuracy is shown in Table 3.1.

3.2 Cluster Setups

A 2-node GPU cluster was used for the experiments. In the next twosections the hardware and software setup of both nodes are specified.

3.2.1 Hardware Specification

The two machines in the cluster used for the experiments have identi-cal hardware specifications that is shown below:

• CPU: 2x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz

• Cores per socket: 8

• Threads per core: 2

• CPU MHz: 1273.699, min 1200, max 3000

• Memory: 256 GB

22 CHAPTER 3. METHOD

• GPU: 10x GeForce GTX 1080 Ti, PCIe Gen3, x16 (16 lanes)

• InfiniBand: QLogic Corp. IBA7322 QDR InfiniBand HCA, PCIeGen2, x8 (8 lanes)

Giving a total of 64 cores, 512 GB of memory, and 20 GPUs. The 10GPUs in one machine are divided into two groups of 5 placed on dif-ferent PCIe buses as shown in Figure 3.1. With a connection travers-ing a PCIe switch as well as a PCIe Host Bridge (typically the CPU)between GPUs in different groups. Within a group each GPU is con-nected to the others with a connection traversing a single PCIe switch.The two nodes are connected with a 40Gb/s InfiniBand connectionand a 1Gb/s Ethernet cable.

InfiniBand performance might be affected by the bandwidth limi-tation introduced by the PCIe. The maximum possible bandwidth ofthe PCIe is calculated by multiplying the PCIe width and speed mi-nus ~1Gb/s for error correction protocols and 20% PCIe headers over-head. The speed and width of the PCIe expansion bus the InfiniBandcard supports are: 5GT/s (GT/s stands for "billion transactions persecond") and x8 respectively. PCIe maximum possible bandwidth is5G ∗ 8 ∗ (1− 1/5)− 1G = 40G ∗ 0.8− 1G ≈ 31Gb/s.

3.2.2 Distributed Deep Learning and Big data Frame-works

Both machines in the cluster were running CentOS 7.2 and were in-stalled with the following software and distributed deep learning frame-works.

• Open MPI: version 3.0.1

• CUDA Toolkit: version 9.0

• NCCL: version 2.1.15

• cuDNN: version 7.0

• Hops-YARN: version 2.7

• Spark: version 2.2.1

• TensorFlow: version 1.7


Figure 3.1: Two nodes with 10 GPUs each and a 40Gb/s InfiniBandconnection running between them. The InfiniBand cards are using aPCIe gen2 expansion bus with a speed of 5GT/s and width of x8. TheGPUs are mounted on PCIe gen3 expansion bus with a speed of 8GT/sand width of x16.

• TensorFlowOnSpark: version 1.3.5(hoshadoop fork https://github.com/hopshadoop/TensorFlowOnSpark)

• horovod: version 0.12.1

3.3 Experiment Design

The running time performance of the distributed deep learning frame-works is measured by the number of images different models are ableto process in distributed mode. Because the batch size can affect thenumber of images per second that can be processed all models weretested with the same batch size. The scalability of the models is thencompared with the ideal scaling (i.e. images/sec for one worker mul-tiplied by the number of workers) to give a performance metric thatis independent of the model used in the experiment. This was donefor parameter server and all-reduce algorithms with the two chosenframeworks TensorFlowOnSpark and Horovod.


3.3.1 Batch Size

The mini-batch size used to train a model affects: (1) how fast a modelconverges or if it converges at all (2) the amount of time spent on out-put calculation by each replica of a model, and (3) the amount of datathat needs to be read from disk before each mini-batch is processed(only in the case of real data experiment).

In this work none of these three properties were analyzed, and soa mini-batch size of 32 per worker is used for all experiments. This isbecause 32 was the biggest batch size that can fit in memory for somebig models.

3.3.2 Number of Workers

The number of workers in the experiments were limited by the numberof GPUs available in the cluster. On the parameter server experimentsall 20 GPUs could not be used because of the low bandwidth avail-able between the two nodes and TFoS not supporting InfiniBand. Soall parameter server experiments were limited to 10 GPUs. An experi-ment involving two nodes is presented in section 4.1.3 to show the net-work bandwidth bottleneck. Given the limited number of GPUs avail-able only one and two parameter servers are tested. In hops-hadoopversion of TensorFlowOnSpark GPUs are not assigned to parameterservers allowing for 10 worker experiments.

3.3.3 System Monitoring

System usage statistics were collected using the NVIDIA System Man-agement Interface (nvidia-smi) and Collectl. Collectl was used to col-lect usage data on CPU, network, and InfiniBand. While nvidia-smiwas used to get GPU usage statistics. Both monitoring services werestarted on both nodes at the start of each experiment and stoppedwhen the experiment finished.

3.4 Model Deployment

Model deployment and resource allocation is done through spark withYARN cluster manager, both for TensorFlowOnSpark and Horovod.


TensorFlowOnSpark is built on top of spark, thus needing little modi-fication to work with hops-hadoop YARN. Horovod on the other handonly works with MPI. A framework around MPI, that uses spark forresource allocation with YARN, was built to allow MPI to spawn pro-cesses in the cluster and have access to resources. In the next sectionsthe design of the deployment architecture for both frameworks are de-scribed.

3.4.1 Parameter Server with TensorFlowOnSpark

TensorFlowOnSpark’s GPU support is not managed. After each ex-ecutor is started by Spark it will check the GPU utilization, by run-ning nvidia-smi (NVIDIA System Management Interface), and take anyGPU with low memory utilization. This will clearly create a race con-dition where two or more executors check the utilization at the sametime and start using the same GPU. On hops-hadoop, YARN managesGPUs and executors only get access to GPUs allocated exclusively tothe container they are running in.

Resource allocation and model deployment for TFoS on hops-hadoopinclude the following steps:

1. Spark makes a resource request to YARN with number of execu-tors, CPU cores per executor, memory per executor, and numberof GPUs per executor.

2. When the resource request is fulfilled TFoS starts a coordinationserver on the driver and send its address to all executors alongwith the client code. This server will wait until all executors areregistered with their respective address and port.

3. After all executors are registered TFoS creates a cluster specifi-cation containing parameter server and worker addresses. Thisis then used by TensorFlow to create a session for each task. Asample cluster specification with one parameter server and threeworkers is shown in Listing 1.

4. Parameter updates and gradient broadcasting are then handledby distributed TensorFlow.


1 Cluster spec: {2 ’ps’: [’10.0.1.15:4287’],3 ’worker’: [’10.0.1.16:4062’, ’10.0.1.16:3426’,4 ’10.0.1.18:4129’]5 }

Listing 1: Sample cluster specification with one parameter server andthree workers. The cluster specification is a python dictionary used byTensorFlow to create a session for each task.

3.4.2 Ring All-Reduce with Horovod

Horovod relay on MPI to spawn processes and NCCL (NVIDIA’s li-brary for collective communication) to handle communication betweenprocesses. After MPI started all processes and assigned ranks to eachprocess, horovod uses the global ranks to create a communication linkbetween neighboring workers. This will create the ring in ring all-reduce collective communication algorithm. Horovod broadcasts vari-ables from the process with rank 0, to ensure consistent initialization.Each process will pin a single GPU with its local rank and adds it toTensorFlows device list, which maps physical GPU ids to virtual GPUids. This will ensure no two processes are assigned the same GPU.

MPI processes are created using a script similar to the one shownin Listing 3. The mpirun program creates process on remote machinesusing ssh. A hostfile can be used to specify where to create the pro-cesses or to limit the number of processes that can be created on eachhost as shown in Listing 2.

1 #------- Host file --------2 node1 max_slots=203 node2 max_slots=204 #--------------------------

Listing 2: Sample mpirun hostfile with 2 nodes allowing a maximumof 20 processes per node. The host file is a newline separated text filewith each node in one line.


1 % mpirun \2 -hostfile hostfile \3 -np 24 -H node15 -wdir WORKING_DIR \6 -x CUDA_VISIBLE_DEVICES="0,1" \7 python main.py : \8 -np 29 -H node2

10 -wdir WORKING_DIR \11 -x CUDA_VISIBLE_DEVICES="2,4" \12 python main.py

Listing 3: Sample mpirun bash script to start 2 processes per node on 2nodes. The wdir argument changes the working directory of the pro-cess and CUDA_VISIBLE_DEVICES will make the specified GPU idsvisible to the process. The hostfile is similar to the one shown in List-ing 2 and is used to limit the number of processes that can be createdon each node.

Monitoring MPI processes

MPI processes and their resource usage can be monitored from themachine that run the mpirun program using ompi-top. This MPI util-ity program takes the process id of the mpirun program (as shown inListing 4) and returns usage statistics of all processes started by it.

1 % ompi-top -pid <mpirun process id>2 % nvidia-smi --query-compute-apps=gpu_uuid,pid \3 --format=csv

Listing 4: Bash commands used for system usage statistics collection.The first command is similar to Linux top command and displays sys-tem usage of all processes started by the mpirun process identified byits pid. The second command lists the gpu_uuid and pid of all GPUsthat are being used by a process given by its pid.


Monitoring GPU Usage

The GPU usage on the cluster is monitored using NVIDIA SystemManagement Interface (nvidia-smi). The command shown in Listing 4will return all GPUs that are in use with the process id of the programusing them.

MPI Wrapper

In the next three sections a complete architecture of the deploymentmechanism for ring all-reduce (horovod) training job is presented. Todeploy an MPI job on a hadoop cluster:

1. Resources need to be allocated.

2. An MPI process need to be launched with a system user that havessh access to all nodes in the cluster.

3. The processes spawned by MPI need to be monitored.

Resources Allocation

Resource allocation is done through Spark on YARN. After Spark ac-quires all the required resources the driver will start a simple socketserver to coordinate all workers. Then the driver launches the clientcode on every executor with the address of the coordination serverand waits until all workers report back with the necessary informa-tion. Each executor’s client code collects: its working directory, hostname or ip address, assigned GPU uuid (gpu_uuid: NVIDIA GPUuniversally unique identifier), and any environment variable set forthe executor and sends it to the coordination server. Finally, after thecoordination server received all the responses it will make a call to theMPI wrapper application REST endpoint with the information fromthe executors and a path to the user program that MPI should launch.

Launching MPI Processes

The MPI wrapper is an application that can be deployed on any webserver that support Java EE application. The wrapper exposes RESTendpoints for:


• Starting MPI process: given a valid YARN application id and aprogram to run, it will start the MPI process and return its pro-cess id.

• Stopping running MPI process: given a valid YARN applicationid and MPI process id, it will stop the running MPI process.

• Getting status of MPI process: given a valid YARN applicationid and MPI process id, it will return the current status of the job(Running or Stopped).

• Getting log of MPI process: given a valid YARN application idand MPI process id, it will return the stdout and stderr of the MPIprocess.

• Getting all running MPI processes: given sufficient access level(Administrator), it will return all running MPI jobs.

To access any of the endpoints listed above a user need to have avalid authentication token or session key. This is used to identify theuser making the request and do any necessary access control. In addi-tion the YARN application id is used to check if the authenticated userhave access to the application she/he is trying to run an MPI processfor.

When a start MPI process request is received and all the access con-trol checks have passed:

1. An mpirun script similar to the one in Listing 3 is constructedand executed.

2. The process id of the mpirun process is saved in a database alongwith the application id of the Spark job and the resources allo-cated to the application (i.e CPU cores, GPUs, and memory).

3. A status OK with the process id is sent back as a response.

The Spark Driver can subsequently query the MPI wrapper for thestatus of the running MPI process and retrieve logs using the processid and application id. If the Spark job is stopped properly it can callstop MPI process before exiting.


Monitoring MPI Processes

Monitoring the MPI process is necessary for two reasons: (1) becausethe MPI process is running outside a managed container it needs to bemonitored for malicious activities and (2) if the Spark job is terminatedsuddenly without a proper cleanup, stray MPI processes will remainin the system consuming resources.

The Monitoring thread is a timer Enterprise Java Beans (EJB) thatruns periodically to check process resource usage and cleanup strayMPI processes. On every run the timer thread:

1. reads all entries in the table containing the appid and pid of theMPI job. The appid is used to query YARN for the status of theapplication. If the application is in any of the states[FINISHED,FAILED,KILLED] the MPI process is killed, ifstill running, and the row is removed from the table.

2. for each row remaining in the table the usage is compared withthe assigned resources stored in a table. If the usage exceeds theresources assigned, the process is killed and the user added to atable that can be used to block users.

The process of starting and monitoring MPI processes is shown in Fig-ure 3.2

3.4.3 Evaluation of Model Deployment

The distributed deep learning model deployment method for ring all-reduce need to fulfill three requirements to run on hops-hadoop clus-ter:

1. The user should have access to program output.

2. The resource allocation should be managed by YARN.

3. Any process created by a user should be isolated and monitored.

These requirements need to be fulfilled by all user submitted programon hops-hadoop cluster.


node2 node3

node1

node4

executor 1

executor 3

executor 2

executor 4

executor 7

executor 6

executor 5

Spark driver

MPI wrapper

mpirun process

mpi 1

mpi 2

mpi 3

mpi 4

mpi 7

mpi 6

mpi 5

Database

start executors

call MPI start

Monitoring

Data: MPI Jobs tablewhile rows in jobs table do| read current row;| endStates ← [FINISHED, FAILED, KILLED] ;| if getAppState(row.appid) in endStates then| | killProcessIfExist(row.mpiPid);| | removeRow(row);| else| | usage ← checkSystemUsage(row.mpiPid) ;| | if usage > row.allocated then| | | killProcessIfExist(row.mpiPid);| | | reportSystemAbuse(row.user);| | | removeRow(row);| | end| endend

Figure 3.2: Starting and monitoring MPI processes using the MPIwrapper module.

3.5 Data Collection

A run with a given number of batch iterations, mini-batch size, num-ber of workers, number of parameter servers (only for TFoS), and agiven model constitutes a single experiment. Every experiment is runfrom three to five times and the average of the runs is taken as theresult of the experiment. The cluster used for the experiments was iso-lated to remove any interference from other processes. This made thenumber of experiments needed, to get a result with low variance man-ageable. The experiments were performed using TensorFlow bench-marks [44] with the setup given in Table 3.2.


Table 3.2: Experiment setup

Parameter server (TFoS) Ring all-reduce(Horovod)Sync Update Async Update

# workers 3 - 10 3 -10 3 - 20# parameter servers 1,2 1 N/ABatch size 32/worker 32/worker 32/workerImageNet Dataset synthetic synthetic synthetic# batches 100-200 100-200 100-200

Chapter 4

Results

In this chapter the results of the experiments run with the two dis-tributed deep learning architectures, parameter server and ring all-reduce, are presented. Furthermore, the model deployment systemdiscussed in section 3.4.2 (Ring All-Reduce with Horovod) is assessedon the evaluation criteria given in section 3.4.3.

4.1 Parameter Server

For the experiments in parameter server with TFoS, one and two pa-rameter servers with workers ranging between 3-10 and 4-10 respec-tively were performed. The experiments were limited to 10 workersbecause distributed TensorFlow (version 1.7) only supports RDMAover Converged Ethernet (RoCE), which is not available on the clusterused for the experiments. However, some system monitoring resultsfrom a run on two nodes are reported in section 4.1.3 to show networkand system usage.

4.1.1 Synchronous Update

Synchronous update as discussed in section 2.4.1 requires all work-ers to update a global model before gradients can be broadcast to allworkers. In the experiments with synchronous updating one and twoparameter servers were used to investigate the effects of an extra net-work bandwidth. Figure 4.1 shows images per second processed bydifferent models with one and two parameter servers. Here, the two

33

34 CHAPTER 4. RESULTS

parameter servers run showed slightly better performance than thesingle parameter server one.

4.1.2 Asynchronous Update

The two models that scaled best in the previous experiment were usedfor comparison on the asynchronous gradient updating experiments.Figure 4.2 shows images per second processed by inception4 and resnet50models with one parameter server in asynchronous parameter updatemode. The asynchronous parameter update mode showed no signif-icant performance improvement over synchronous parameter updatemode.

4.1.3 Multi-Node Synchronous Update

A multi-node experiment with four workers and a single parameterserver is presented in this section to show the network bottleneck thatrestricted parameter server experiments from using two nodes. Inthis experiment two workers were placed on one node and two moreworkers plus the parameter server on another. The system usage ofthis experiment is shown in Figure 4.3. This run did not complete andhad to be killed after running for 36 min (as can be seen on the x-axisof Figure 4.3). The main thing to notice in Figure 4.3 is the networkI/O, that clearly shows the 1Gb/s network bottleneck.

All parameter server runs on multi-node fail to complete. Thus theremaining parameter server experiments were forced to run on onenode by decommissioning one of the nodes (i.e. stopping the NM onthat node), forced YARN to schedule all jobs on the remaining node.

4.2 Ring All-Reduce

The ring all-reduce (horovod) experiments were deployed using a sim-ilar MPI script to the one shown in Listing 3, where workers rangingbetween 3 and 10 were placed on the first node and workers between11 and 20 on the second node. The ring formed on a single node havepeer-to-peer connection between all workers, while the ring with 11workers need to traverse an InfiniBand interconnect (NET/IB) as canbe seen in Figure 3.1.

CHAPTER 4. RESULTS 35

020406080

100120140160180200220240260280300320340360380400420440460480500520540

1 2 3 4 5 6 7 8 9 10

imag

es/sec

NumberofGPUs

Numberofimages/seconeparameterserver

vgg19vgg16

resnet152v2inception4

resnet50

(a) One parameter server

020406080

100120140160180200220240260280300320340360380400420440460480500520540

1 3 4 5 6 7 8 9 10

imag

es/sec

NumberofGPUs

Numberofimages/sectwoparameterservers

vgg16vgg19

resnet152v2inception4

resnet50

(b) Two parameter servers

Figure 4.1: Number of images processed per second with differentmodels and number of workers in synchronous gradient update mode.a) Showing results using one parameter server and b) showing resultsusing two parameter servers.


6080

100120140160180200220240260280300320340360380400420440460480500520540

1 3 4 5 6 7 8 9 10

imag

es/sec

NumberofGPUs

Numberofimages/seconeparameterserverasyncupdatemode

inception4 resnet50

Figure 4.2: Number of images processed per second with differentmodels and number of workers in asynchronous gradient updatemode.

The results from the ring all-reduce (horovod) experiments are shownin Figure 4.4. All models except vgg16 and vgg19 show improvementwith each added worker. While, the images processed per second, forthe same amount of workers, compared to parameter server has al-most doubled.

4.3 Scalability

The speedup in the number of images per second processed with eachadded computational resource (GPU) is used here to show the scala-bility of the algorithms. Figure 4.5, shows the speedup in parameterserver mode compared to the ideal scaling shown in green. Figure4.5 (a), showing one parameter server with synchronous update, (b)one parameter server with asynchronous update, and (c) two parame-ter servers with synchronous update. The speedup for ring all-reduce(horovod) experiments are shown in Figure 4.6.


15202530354045505560

CPUusagein%

Machine1

UserSys

020406080100120140

NetworkinM

B RxTotalTxTotal

RxErrsTotal

010203040506070

09.00 12.00 15.00 18.00 21.00 24.00 27.00 30.00 33.00 36.00 39.00 42.00 45.00

GPUusagein%

time

GPUUtil

(a) System usage on Machine 1

010203040506070

CPUusag

ein%

Machine2

UserSys

0200400600800

10001200

Netw

orki

nMB RxTotal

TxTotalRxErrsTotal

0102030405060708090

09.00 12.00 15.00 18.00 21.00 24.00 27.00 30.00 33.00 36.00 39.00 42.00 45.00

GPUusag

ein%

time

GPUUtil

(b) System usage on Machine 2

Figure 4.3: System usage in parameter server with multi-node. Twoworkers running on Machine 1 and the parameter server along withtwo workers running on Machine 2. The network usage in a) showshow the message from the two workers to the parameter server ischoking the 1Gb/s Ethernet connection.


0100200300400500600700800900

1000110012001300140015001600170018001900200021002200230024002500

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

imag

es/s

ec

NumberofGPUs

Numberofimages/secusingRingall-reduce

vgg19vgg16

inception4resnet152v2

resnet50

Figure 4.4: Number of images processed per second with differentmodels and number of workers with ring all-reduce (horovod)

4.4 Resource Utilization

The resource utilization is directly affected by the time spent on com-munication as can be seen in Figures 4.7 and 4.8. Figure 4.7 a) ShowsGPU usage close to 100% for the entire run in ring all-reduce while b)shows a usage that bounces up and down with the network I/O. Asimilar trend is seen in Figure 4.8 where the GPU utilization increaseswhen network I/O is low.


The prototype built for ring all-reduce model deployment fulfilled twoof the requirements but the third involved adding an extra service.The extra service monitors the activity of the user program and per-form cleanup of stray processes. The isolation requirement was metby running all processes as a system user with limited privileges.


0123456789

10

1 2 3 4 5 6 7 8 9 10

speedup(x)

NumberofGPUs

Speedupofmodelssyncupdateoneparameterserver

vgg19ideal

vgg16resnet152v2

inception4resnet50

(a) Results for synchronous update mode with one parameter server

0123456789

10

1 3 4 5 6 7 8 9 10

spee

dup(x)

NumberofGPUs

Speedupofmodelsasyncupdateoneparameterserver

vgg16ideal

inception4resnet50

(b) Results for asynchronous update mode with one parameter server.

0123456789

10

1 3 4 5 6 7 8 9 10

speedup(x)

NumberofGPUs

Speedupofmodelssyncupdatetwoparameterservers

vgg16ideal

vgg19resnet152v2

inception4resnet50

(c) Results for synchronous update mode with 2 parameter servers

Figure 4.5: Speedup of model training with each added worker (GPU)in parameter server mode. The values in the speedup (y-axis) that arezero are missing data where no experiments were done for the numberof workers (on the x-axis).


0123456789

1011121314151617181920

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

spee

dup

(x)

NumberofGPUs

Speedupofmodelswithringall-reduce

vgg19ideal

vgg16inception4

resnet152v2resnet50

Figure 4.6: Speedup of model training with each added worker (GPU)with ring all-reduce (horovod).


5101520253035404550

CPUusag

ein%

Ringall-reduce

UserSys

00.51

1.52

2.53

3.54

4.5

Netw

orkinM

B RxTotalTxTotal

RxErrsTotal

020406080

100

29.50 30.00 30.10 30.20 30.30 30.40 30.50 31.00 31.10 31.20 31.30 31.40 31.50

GPUusag

ein%

time

GPUUtil

(a) System usage for ring all-reduce

0102030405060708090

100

CPUusag

ein%

Parameterserver

UserSys

0500

1000150020002500

Netw

orkinM

B RxTotalTxTotal

RxErrsTotal

020406080

100

47.00 48.00 49.00 50.00 51.00 52.00 53.00 54.00 55.00

GPUusag

ein%

time

GPUUtil

(b) System usage for parameter server

Figure 4.7: Ring all-reduce (a) and parameter server (b) system usagecomparison on inception4 with 10 workers.


0102030405060708090100

CPUusagein%

Oneparameterserver

UserSys

05001000150020002500

NetworkinMB RxTotal

TxTotalRxErrsTotal

020406080100

47.00 48.00 49.00 50.00 51.00 52.00 53.00 54.00 55.00

GPUusagein%

time

GPUUtil

(a) System usage for one parameter server

0102030405060708090

CPUusag

ein%

Twoparameterservers

UserSys

0500

1000150020002500

Netw

orkinMB RxTotal

TxTotalRxErrsTotal

020406080

100

33.00 34.00 35.00 36.00 37.00 38.00 39.00 40.00

GPUusag

ein%

time

GPUUtil

(b) System usage for two parameter servers

Figure 4.8: One and two parameter servers system usage for incep-tion4 with 10 workers.

Chapter 5

Discussion and Conclusion

Training deep learning models require big data, making it computa-tionally intensive and time consuming. Distributing computation onmultiple devices helps reduce the time needed to train deep learningmodels. In this paper two widely used distributed deep learning train-ing algorithms are analyzed with respect to scalability, fault tolerance,resource utilization, and ease of deployment in a shard hadoop clusterenvironment.

The main focus of the study was data parallelism. Experiments onparameter server and ring all-reduce algorithms, using TensorFlowon-Spark and horovod frameworks, were performed on a cluster with twonodes and 20 GPUs. The goal was to find limitations and bottlenecksin the algorithms when deployed on a hadoop cluster. A method fordeploying distributed training using MPI was also explored and a pro-totype developed and tested.

In this section the results presented in the previous section are an-alyzed, discussed and conclusions are given on the feasibility of run-ning distributed deep learning algorithms as-a-service in a shared clus-ter environment.

5.1 Scalability

From the results shown in Figures 4.5 and 4.6 it is easy to see thatring all-reduce (horovod) scales better on all models. As discussed insection 2.3 ring all-reduce is a bandwidth optimal algorithm and thisplayed a role in its scalability. Figure 4.7 shows the CPU usage, net-work I/O, and GPU utilization of inception4 with 10 workers for ring

43

44 CHAPTER 5. DISCUSSION AND CONCLUSION

all-reduce and parameter server. In the case of ring all-reduce workershave a high bandwidth P2P PCIe link, so the network I/O is not rele-vant. On the other hand parameter server uses gRPC and the networkI/O shows how much data is being transferred between the parameterserver and workers. Another thing that is noteworthy in Figure 4.7 isthe GPU utilization. In ring all-reduce the utilization stays at a 100%for the majority of the execution, while in parameter server it bouncesup and down with the network I/O. This also suggests that parame-ter server is spending much more time on communication where theGPUs are not in use.

These experiments were only done using synthetic data. If real datawas used the GPU utilization of the ring all-reduce algorithm wouldhave been lower. When all workers try to read data from disk, a simi-lar bottleneck that exists in the parameter server model would be cre-ated on the disk I/O. By overlapping network and disk I/O parameterserver can probably maintain similar performance which is not possi-ble in ring all-reduce.

5.1.1 Number of Parameter Servers

The difference in speedup between one and two parameter servers isnot that significant as can be seen in Figures 4.1 and 4.5 and can alsobe explained by the extra network bandwidth available for workers tocommunicate with the parameter server. Figure 4.8 shows a clear dif-ference in the network I/O (in MB total) between one and two param-eter servers with ten workers. The one parameter server run (Figure4.8 a) have network I/O that is slightly higher than the two parameterservers run (Figure 4.8 b) and lasted for the entire run. The two param-eter servers run shows wider deeps with no network I/O that suggestcommunication is taking less time.

The results here only show that there is a performance gain to bemade by adding parameter servers. But more experiments are neededto establish a good ratio of parameter servers to workers. The addedCPU core on the second parameter server might also account for someof the performance gain.

CHAPTER 5. DISCUSSION AND CONCLUSION 45

5.1.2 Asynchronous Update

Asynchronous update did not show any speedup in performance com-pared to synchronous update. For this experiment inception4 andresnet50 were used and the results are shown in Figures 4.2 and 4.5b.One reason, for the asynchronous update performing similar to thesynchronous one, can be the fact that all workers were started at thesame time and have the same computational resources. Thus, evenwithout any synchronization constraint all workers will finish and tryto update the parameter server at the same time.

5.1.3 Ring All-Reduce

The ring all-reduce experiments scaled almost linearly on some mod-els but did not do as well for the bigger models like vgg16 and vgg19.The results for ring all-reduce experiments are shown in Figures 4.4and 4.6. There are two deeps in the graph of Figure 4.6 when movingfrom 10 to 11 and 5 to 6 workers (on the x-axis). The first one is theInfiniBand network in the ring formed across machines and it is morevisible on the bigger models vgg16 and vgg19. The second one be-tween 5 and 6 on the x-axis is only visible on the big models and is thePCIe Host Bridge (typically the CPU) between GPUs in different group(see section 3.2.1 for GPU placement). As discussed in section 2.4.4 thespeed of ring all-reduce algorithm is limited by the slowest communi-cation link between neighbors, which can be seen clearly in this experi-ment. This shows that for big models even a 40Gb/s (32Gb/s effectivebandwidth) link can be inadequate. Figure 5.1 shows the data sentvia InfiniBand for inception4 which was least affected by the multi-node ring and vgg16 which was affected the most. For vgg16 between1000MB - 1700MB both in and out total is registered every second forthe entire run.

This is a clear sign that the processing power of the GPU is sub-stantially outstripping the capabilities of the network I/O (InfiniBand+ PCIe) on big models (i.e. models with large number of parameterscompared to the number of operations).


Machine1

0100200300400500600700

IBin

MB

Inception4

InTotalOutTotalErrTotal

0200400600800

10001200140016001800

40.30 40.40 40.50 41.00 41.10 41.20 41.30 41.40 41.50 42.00 42.10 42.20 42.30 42.40

IBin

MB

VGG16


(a) Infiniband usage on Machine 1

Machine2

0100200300400500600700

IBin

MB

Inception4


0200400600800

10001200140016001800

40.40 40.50 41.00 41.10 41.20 41.30 41.40 41.50 42.00 42.10 42.20 42.30 42.40

IBin

MB

VGG16


(b) Infiniband usage on Machine 2

Figure 5.1: InfiniBand usage for inception4 and vgg16 with 11 work-ers. (a) Machine 1 running workers 1 to 10 and (b) Machine 2 runningworker 11. With ring all-reduce algorithm (horovod).


5.2 Fault Tolerance

Both ring all-reduce and parameter server with synchronous updateare not fault tolerant i.e. if one worker fails the job cannot recover. Onthe other hand parameter server with asynchronous update can copewith the loss of a worker because workers update model gradientsindependently. This makes it more suitable for a dynamic and sharedcluster environment where workers can be killed or die due to resourcestarvation.

Most implementations of deep learning training checkpoint progressand can recover from a checkpoint after failure. YARN and Sparkboth have support for re-submitting failed jobs by setting a property,that control the number of retries, when starting an application. Soby checkpointing training progress and using the retry capability ofYARN and Spark some level of fault tolerance can be achieved evenwhen synchronous training methods are used.

5.3 Resource Utilization

GPUs are valuable resource on a cluster, but cannot be shared by mul-tiple jobs; even when a GPU is underutilized the job that reserved itmight be using all available memory on the GPU. This makes under-utilized GPUs wasted computational resources. As Figures 4.7 and 4.8show the available communication bandwidth affects the GPU utiliza-tion considerably, and thus should be the main consideration whentraining is distributed.

However, the communication bandwidth required for optimal uti-lization of the GPUs is not only determined by the size of the modelto be trained. Other factors like the number of floating point opera-tions needed to compute the gradients and the computational powerof the GPUs also determine the amount of data communicated, andconsequently the bandwidth required. Therefore, when configuring acluster for distributed training the model parameter size and floatingpoint operations, the GPU power, the network bandwidth, and diskI/O need to be examined to maximize GPU utilization.



The MPI wrapper prototype built and used to deploy deep learningtraining with horovod work for the purpose it was built for, but it addsan extra point of failure and an extra service to maintain in a systemthat already contain many services and is complicated. Being able torun MPI processes inside a YARN container will eliminate the needfor an extra system to monitor MPI processes. This will also give astronger process isolation.

5.5 Possible Sources of Error

The results reported in this paper did not account for all factors thatcan affect the performance of a distributed training. Disk I/O is onesuch factor that was not considered, but can have considerable impacton the scalability of the architectures. In particular the disk I/O willhave more impact on ring all-reduce. This is mainly because there areno other considerable bottlenecks in the architecture.

To eliminate any interference from other processes all experimentswere performed in an isolated cluster environment. This helped in get-ting results with low variance in fewer experiments (all results withstandard deviation can be seen in Appendix A). However, this alsomade the results less relevant to a real-world dynamic cluster environ-ment.

5.6 Conclusion

Distributing deep learning training did improve performance in mostof the experiments done for this work. Although there is a clear differ-ence in performance between the algorithms, there is no one methodthat is better than the others in all aspects. Ring all-reduce scales bestin all tests, but parameter server is easier to deploy in a hadoop clusterand parameter server with asynchronous update mode is more faulttolerant, thus well suited for a dynamic environment.

The choice of distributed deep learning algorithm should take intoconsideration:

• The size of the model to be trained and the resources available.


For example: to train a big model like vgg16 the network band-width, PCIe bus speed, and GPU placement need to be consid-ered but for a smaller one we might only need to consider net-work bandwidth.

• The trade-off between scalability and fault tolerance. If the clus-ter in use is not shared, fault tolerance might not be as importantto consider and we can choose a scalable algorithm like ring all-reduce. On the other hand in a shared dynamic environment,workers might run slow or even die because of resource starva-tion or network congestion. In these situations fault tolerancemay be more desirable.

• When it comes to deployment both algorithms use a well knownframework (Spark) that is relatively easy to use, but ring all-reduce involves extra services like MPI and the MPI wrapper.This might require additional effort for installation and mainte-nance.

These conclusions are based on the limited number of experimentsthat were performed on a single cluster configuration. Furthermore,all the variables that can affect the performance were not included inthe experiments (e.g. disk I/O). Despite these limitations, this workwill contribute in identifying possible bottlenecks and pointing outconsiderations to be made when training is distributed.

5.7 Future Work

This work only explored the effects of communication on distributeddeep learning training, but the effects of disk I/O have not been con-sidered. In the future disk I/O with different disk types, distributedfile systems, and reading methods like Spark or TensorFlow can beinvestigated.

The ratio of parameter servers to workers is something that needsmore experiment but was not considered in this work. It was shownin this paper that adding parameter servers can improve performance.As a future work this ratio can be studied for different network con-figurations.

Parameter server with asynchronous update was only tested withworkers running on devices with the same computational power and


that were started simultaneously. It would be interesting to investigatethe performance improvement if workers were to run on devices withdifferent power and/or started out of sync.

The experiments did not include model parallelism, which is in-teresting for big models that can not fit in the memory of a singleGPU. The effects of communication, disk I/O, device placement, andscheduling on model parallelism can be interesting to look into in thefuture.

Model deployment for MPI processes uses a simple resource moni-toring on top of an existing well-tested YARN resource manager. Thus,a method that will allow MPI processes to run inside YARN containersis worth investigating.

Finally by performing similar experiments on different hardwaresetups clear guidelines can be developed. These guidelines can beused to predict the expected speedup and resource utilization of adistributed deep learning training given as input: a hardware setup,model size, and the operational complexity of a model.

Bibliography

[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I.J. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Józefow-icz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S.Moore, D. G. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I.Sutskever, K. Talwar, P. A. Tucker, V. Vanhoucke, V. Vasudevan,F. B. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y.Yu, and X. Zheng. “TensorFlow: Large-Scale Machine Learningon Heterogeneous Distributed Systems”. In: Computing ResearchRepository abs/1603.04467 (2016). arXiv: 1603.04467. URL: http://arxiv.org/abs/1603.04467.

[2] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J.Smola. “Scalable Inference in Latent Variable Models”. In: Pro-ceedings of the Fifth ACM International Conference on Web Searchand Data Mining. WSDM ’12. Seattle, Washington, USA: ACM,2012, pp. 123–132. ISBN: 978-1-4503-0747-5. DOI: 10.1145/2124295.2124312. URL: http://doi.acm.org/10.1145/2124295.2124312.

[3] T. Akiba, S. Suzuki, and K. Fukuda. “Extremely Large MinibatchSGD: Training ResNet-50 on ImageNet in 15 Minutes”. In: Com-puting Research Repository abs/1711.04325 (2017). arXiv: 1711.04325. URL: http://arxiv.org/abs/1711.04325.

[4] G. M. Amdahl. “Validity of the Single Processor Approach toAchieving Large Scale Computing Capabilities”. In: Proceedingsof the April 18-20, 1967, Spring Joint Computer Conference. AFIPS’67 (Spring). Atlantic City, New Jersey: ACM, 1967, pp. 483–485.DOI: 10.1145/1465482.1465560. URL: http://doi.acm.org/10.1145/1465482.1465560.

51

52 BIBLIOGRAPHY

[5] R. Andersson. “GPU integration for Deep Learning on YARN”.MA thesis. KTH, School of Information and Communication Tech-nology (ICT), 2017.

[6] B. Baker, O. Gupta, N. Naik, and R. Raskar. “Designing NeuralNetwork Architectures using Reinforcement Learning”. In: Com-puting Research Repository abs/1611.02167 (2016). arXiv: 1611.02167. URL: http://arxiv.org/abs/1611.02167.

[7] L. Bottou. “Stochastic gradient learning in neural networks”. In:Proceedings of Neuro-Nımes 91.8 (1991), p. 0.

[8] T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. “ProjectAdam: Building an Efficient and Scalable Deep Learning Train-ing System”. In: 11th USENIX Symposium on Operating SystemsDesign and Implementation (OSDI 14). Broomfield, CO: USENIXAssociation, 2014, pp. 571–582. ISBN: 978-1-931971-16-4. URL: https://www.usenix.org/conference/osdi14/technical-sessions/presentation/chilimbi.

[9] F. Chollet. keras. https://github.com/fchollet/keras.2015.

[10] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhu-ber. “Deep Big Simple Neural Nets Excel on Handwritten DigitRecognition”. In: Computing Research Repository abs/1003.0358(2010). arXiv: 1003.0358. URL: http://arxiv.org/abs/1003.0358.

[11] A. Coates, A. Ng, and H. Lee. “An analysis of single-layer net-works in unsupervised feature learning”. In: Proceedings of thefourteenth international conference on artificial intelligence and statis-tics. 2011, pp. 215–223.

[12] R. Collobert, K. Kavukcuoglu, and C. Farabet. “Torch7: A Matlab-like Environment for Machine Learning”. In: BigLearn, NIPS Work-shop. 2011, pp. 1–6. URL: http://publications.idiap.ch/downloads/papers/2011/Collobert_NIPSWORKSHOP_2011.pdf.

[13] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M.aurelio Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y.Ng. “Large Scale Distributed Deep Networks”. In: Advances inNeural Information Processing Systems 25. Ed. by F. Pereira, C. J. C.Burges, L. Bottou, and K. Q. Weinberger. Curran Associates, Inc.,

BIBLIOGRAPHY 53

2012, pp. 1223–1231. URL: http://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf.

[14] J. Dean and S. Ghemawat. “MapReduce: Simplified Data Pro-cessing on Large Clusters”. In: Commun. ACM 51.1 (Jan. 2008),pp. 107–113. ISSN: 0001-0782. DOI: 10.1145/1327452.1327492.URL: http://doi.acm.org/10.1145/1327452.1327492.

[15] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J.M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine,R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall.“Open MPI: Goals, Concept, and Design of a Next GenerationMPI Implementation”. In: Proceedings, 11th European PVM/MPIUsers’ Group Meeting. Budapest, Hungary, 2004, pp. 97–104.

[16] P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski,A. Kyrola, A. Tulloch, Y. Jia, and K. He. “Accurate, Large Mini-batch SGD: Training ImageNet in 1 Hour”. In: Computing Re-search Repository abs/1706.02677 (2017). arXiv: 1706.02677. URL:http://arxiv.org/abs/1706.02677.

[17] S. Gupta, W. Zhang, and F. Wang. “Model Accuracy and Run-time Tradeoff in Distributed Deep Learning: A Systematic Study”.In: Proceedings of the 26th International Joint Conference on ArtificialIntelligence. IJCAI’17. Melbourne, Australia: AAAI Press, 2017,pp. 4854–4858. ISBN: 978-0-9992411-0-3. URL: http://dl.acm.org/citation.cfm?id=3171837.3171972.

[18] A. Halevy, P. Norvig, and F. Pereira. “The Unreasonable Effec-tiveness of Data”. In: IEEE Intelligent Systems 24.2 (2009), pp. 8–12. ISSN: 1541-1672. DOI: 10.1109/MIS.2009.36.

[19] K. He, X. Zhang, S. Ren, and J. Sun. “Deep Residual Learning forImage Recognition”. In: Computing Research Repository abs/1512.03385(2015). arXiv: 1512.03385. URL: http://arxiv.org/abs/1512.03385.

[20] Q. Ho, J. Cipar, H. Cui, J. K. Kim, S. Lee, P. B. Gibbons, G. A. Gib-son, G. R. Ganger, and E. P. Xing. “More Effective DistributedML via a Stale Synchronous Parallel Parameter Server”. In: Pro-ceedings of the 26th International Conference on Neural InformationProcessing Systems - Volume 1. NIPS’13. Lake Tahoe, Nevada: Cur-ran Associates Inc., 2013, pp. 1223–1231. URL: http://dl.acm.org/citation.cfm?id=2999611.2999748.

54 BIBLIOGRAPHY

[21] Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A.Gibson, G. Ganger, and E. P. Xing. “More Effective DistributedML via a Stale Synchronous Parallel Parameter Server”. In: Ad-vances in Neural Information Processing Systems 26. Ed. by C. J. C.Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Wein-berger. Curran Associates, Inc., 2013, pp. 1223–1231. URL: http://papers.nips.cc/paper/4894- more- effective-distributed-ml-via-a-stale-synchronous-parallel-parameter-server.pdf.

[22] K. Hornik. “Approximation capabilities of multilayer feedfor-ward networks”. In: Neural Networks 4.2 (1991), pp. 251 –257.ISSN: 0893-6080. DOI: https://doi.org/10.1016/0893-6080(91)90009-T. URL: http://www.sciencedirect.com/science/article/pii/089360809190009T.

[23] M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Don-ahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan,C. Fernando, and K. Kavukcuoglu. “Population Based Trainingof Neural Networks”. In: Computing Research Repository abs/1711.09846(2017). arXiv: 1711.09846. URL: http://arxiv.org/abs/1711.09846.

[24] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Gir-shick, S. Guadarrama, and T. Darrell. “Caffe: Convolutional Ar-chitecture for Fast Feature Embedding”. In: Computing ResearchRepository abs/1408.5093 (2014). arXiv: 1408.5093. URL: http://arxiv.org/abs/1408.5093.

[25] A. Krizhevsky. “One weird trick for parallelizing convolutionalneural networks”. In: Computing Research Repository abs/1404.5997(2014). arXiv: 1404.5997. URL: http://arxiv.org/abs/1404.5997.

[26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. “ImageNet Clas-sification with Deep Convolutional Neural Networks”. In: Ad-vances in Neural Information Processing Systems 25. Ed. by F. Pereira,C. J. C. Burges, L. Bottou, and K. Q. Weinberger. Curran Asso-ciates, Inc., 2012, pp. 1097–1105. URL: http://papers.nips.cc / paper / 4824 - imagenet - classification - with -deep-convolutional-neural-networks.pdf.

BIBLIOGRAPHY 55

[27] Q. V. Le, R. Monga, M. Devin, G. Corrado, K. Chen, M. Ran-zato, J. Dean, and A. Y. Ng. “Building high-level features us-ing large scale unsupervised learning”. In: Computing ResearchRepository abs/1112.6209 (2011). arXiv: 1112.6209. URL: http://arxiv.org/abs/1112.6209.

[28] Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y.Ng. “On optimization methods for deep learning”. In: Proceed-ings of the 28th International Conference on International Conferenceon Machine Learning. Omnipress. 2011, pp. 265–272.

[29] Y. Lee, S. Jun, C. Bobbie, F. Andy, and Yahoo Big ML team. Ten-sorFlowOnSpark. https://github.com/yahoo/TensorFlowOnSpark.2016.

[30] M. Li, D. G. Anderson, J. W. Park, A. J. Smola, A. Ahmed, V. Josi-fovski, J. Long, E. J. Shekita, and B.-Y. Su. “Scaling DistributedMachine Learning with the Parameter Server”. In: Operating Sys-tems Design and Implementation (OSDI). 2014, pp. 583–598.

[31] H. Ma, F. Mao, and G. W. Taylor. “Theano-MPI: a Theano-basedDistributed Training Framework”. In: Computing Research Repos-itory abs/1605.08325 (2016). arXiv: 1605.08325. URL: http://arxiv.org/abs/1605.08325.

[32] R. Mayer, C. Mayer, and L. Laich. “The TensorFlow Partitioningand Scheduling Problem: It’s the Critical Path!” In: ComputingResearch Repository abs/1711.01912 (2017). arXiv: 1711.01912.URL: http://arxiv.org/abs/1711.01912.

[33] A. Mirhoseini, H. Pham, Q. V. Le, B. Steiner, R. Larsen, Y. Zhou,N. Kumar, M. Norouzi, S. Bengio, and J. Dean. “Device Place-ment Optimization with Reinforcement Learning”. In: Comput-ing Research Repository abs/1706.04972 (2017). arXiv: 1706.04972.URL: http://arxiv.org/abs/1706.04972.

[34] B. C. Ooi, K.-L. Tan, S. Wang, W. Wang, Q. Cai, G. Chen, J. Gao,Z. Luo, A. K. H. Tung, Y. Wang, Z. Xie, M. Zhang, and K. Zheng.“SINGA: A Distributed Deep Learning Platform”. In: ACM Mul-timedia. 2015.

56 BIBLIOGRAPHY

[35] F. Pellegrini. “Distillating knowledge about SCOTCH”. In: Com-binatorial Scientific Computing. Ed. by U. Naumann, O. Schenk, H.D. Simon, and S. Toledo. Dagstuhl Seminar Proceedings 09061.Dagstuhl, Germany: Schloss Dagstuhl - Leibniz-Zentrum fuerInformatik, Germany, 2009. URL: http://drops.dagstuhl.de/opus/volltexte/2009/2091.

[36] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. “LearningInternal Representations by Error Propagation”. In: Parallel Dis-tributed Processing: Explorations in the Microstructure of Cognition,Vol. 1 (1986), pp. 318–362. URL: http://dl.acm.org/citation.cfm?id=104279.104293.

[37] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg,and F. Li. “ImageNet Large Scale Visual Recognition Challenge”.In: Computing Research Repository abs/1409.0575 (2014). arXiv:1409.0575. URL: http://arxiv.org/abs/1409.0575.

[38] A. Sergeev and M. D. Balso. “Horovod: fast and easy distributeddeep learning in TensorFlow”. In: CoRR abs/1802.05799 (2018).arXiv: 1802.05799. URL: http://arxiv.org/abs/1802.05799.

[39] S. Shams, R. Platania, K. Lee, and S. J. Park. “Evaluation of DeepLearning Frameworks Over Different HPC Architectures”. In:2017 IEEE 37th International Conference on Distributed ComputingSystems (ICDCS). 2017, pp. 1389–1396. DOI: 10.1109/ICDCS.2017.259.

[40] S. Shi and X. Chu. “Performance Modeling and Evaluation ofDistributed Deep Learning Frameworks on GPUs”. In: Comput-ing Research Repository abs/1711.05979 (2017). arXiv: 1711.05979.URL: http://arxiv.org/abs/1711.05979.

[41] K. Simonyan and A. Zisserman. “Very Deep Convolutional Net-works for Large-Scale Image Recognition”. In: Computing Re-search Repository abs/1409.1556 (2014). arXiv: 1409.1556. URL:http://arxiv.org/abs/1409.1556.

[42] A. Smola and S. Narayanamurthy. “An Architecture for ParallelTopic Models”. In: Proc. VLDB Endow. 3.1-2 (Sept. 2010), pp. 703–710. ISSN: 2150-8097. DOI: 10.14778/1920841.1920931. URL:http://dx.doi.org/10.14778/1920841.1920931.

BIBLIOGRAPHY 57

[43] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. “Re-thinking the Inception Architecture for Computer Vision”. In:CoRR abs/1512.00567 (2015). arXiv: 1512.00567. URL: http://arxiv.org/abs/1512.00567.

[44] Tensorflow benchmarks. https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks.2018.

[45] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar,R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino,O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler. “ApacheHadoop YARN: Yet Another Resource Negotiator”. In: Proceed-ings of the 4th Annual Symposium on Cloud Computing. SOCC ’13.Santa Clara, California: ACM, 2013, 5:1–5:16. ISBN: 978-1-4503-2428-1. DOI: 10.1145/2523616.2523633. URL: http://doi.acm.org/10.1145/2523616.2523633.

[46] Welcome to ApacheTM Hadoop R©!. URL: http://hadoop.apache.org/ (visited on 01/30/2018).

[47] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave,X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. Ghodsi,J. Gonzalez, S. Shenker, and I. Stoica. “Apache Spark: A UnifiedEngine for Big Data Processing”. In: Commun. ACM 59.11 (Oct.2016), pp. 56–65. ISSN: 0001-0782. DOI: 10.1145/2934664. URL:http://doi.acm.org/10.1145/2934664.

[48] S. Zheng, Q. Meng, T. Wang, W. Chen, N. Yu, Z. Ma, and T. Liu.“Asynchronous Stochastic Gradient Descent with Delay Com-pensation for Distributed Deep Learning”. In: Computing ResearchRepository abs/1609.08326 (2016). arXiv: 1609.08326. URL: http://arxiv.org/abs/1609.08326.

[49] M. A. Zinkevich, M. Weimer, A. Smola, and L. Li. “ParallelizedStochastic Gradient Descent”. In: Proceedings of the 23rd Interna-tional Conference on Neural Information Processing Systems - Volume2. NIPS’10. Vancouver, British Columbia, Canada: Curran Asso-ciates Inc., 2010, pp. 2595–2603. URL: http://dl.acm.org/citation.cfm?id=2997046.2997185.

[50] L. Zou. Bringing HPC Techniques to Deep Learning. en-US. Feb.2017. URL: http://research.baidu.com/bringing-hpc-techniques-deep-learning/ (visited on 03/03/2018).

Appendix A

Complete Results

The average images/sec results shown in histograms are shown herein tabular form with the standard deviation. Models that were notincluded in the result, for having a high images/sec values that madethe plots hard to read, are also presented here.

Table A.1: Average Images/sec of 3-5 runs with standard deviation foreach model with parameter server synchronous update mode (TFoS).

# workersAlexNet VGG16 VGG19 Resnet50 Resnet152 googlenet inception4

i/sec std i/sec std i/sec std i/sec std i/sec std i/sec std i/sec std1 1432.81 0.00 128.90 0.00 109.85 0.00 188.56 0.00 82.03 0.00 440.02 0.00 63.80 0.002 105.27 0.15 0.0 0.0 27.34 9.27 0.0 0.0 119.67 0.00 699.19 0.00 104.86 0.003 131.81 28.11 67.86 1.43 64.88 1.50 355.68 0.99 150.39 4.11 944.22 15.99 145.05 0.414 120.27 20.76 84.06 0.20 80.03 1.77 402.40 0.43 167.64 6.72 1123.82 28.62 178.42 0.345 136.27 22.35 95.79 0.09 91.41 1.47 432.12 3.67 179.95 8.67 1243.83 29.63 194.71 7.206 148.18 23.77 105.66 1.36 102.64 1.80 455.15 1.86 187.93 9.71 1306.78 45.10 211.75 8.947 155.85 24.44 107.85 1.71 106.27 3.59 461.55 5.68 196.67 11.76 1361.72 41.84 215.75 12.758 260.30 0.00 114.73 3.38 111.24 3.13 471.17 2.39 193.01 10.50 1393.88 52.32 227.42 11.039 153.94 23.26 116.74 1.66 114.03 1.63 457.71 6.06 185.67 8.43 1308.25 81.43 222.10 10.0610 153.66 4.43 116.34 2.49 114.34 0.72 466.43 3.90 185.43 2.75 1370.13 7.31 221.94 1.66

58

APPENDIX A. COMPLETE RESULTS 59

Table A.2: Average Images/sec of 3-5 runs with standard deviation foreach model with two parameter servers in synchronous update mode(TFoS).


i/sec std i/sec std i/sec std i/sec std i/sec std i/sec std i/sec std1 1432.81 0.00 128.90 0.00 109.85 0.00 188.56 0.00 82.03 0.00 440.02 0.00 63.80 0.003 164.08 0.00 0.0 0.0 64.80 0.20 0.0 0.0 178.81 0.00 1020.55 0.00 155.45 0.004 211.52 2.84 85.00 1.08 68.15 3.24 455.91 3.13 194.71 7.60 1249.83 0.00 187.90 1.585 236.42 4.22 95.77 1.29 77.44 4.11 489.91 4.18 207.11 10.20 1343.26 33.11 216.91 3.066 257.29 3.79 104.80 0.61 85.99 3.92 512.80 5.59 215.97 10.74 1432.41 32.33 238.84 4.187 266.63 5.08 110.76 5.31 93.03 3.56 524.54 2.43 217.64 11.08 1482.79 44.52 249.46 7.268 274.71 3.98 108.08 0.71 105.37 8.02 532.28 0.59 221.90 7.86 1455.12 67.11 263.29 8.749 0.0 0.0 96.72 0.82 114.39 0.08 526.31 5.90 216.96 1.19 1451.30 3.70 260.49 0.3110 276.99 4.77 97.71 0.02 115.59 0.09 529.85 1.69 217.35 0.17 1458.97 3.01 264.38 2.20

Table A.3: Average Images/sec of 3-5 runs with standard deviationfor each model with parameter servers asynchronous update mode(TFoS).

# workersAlexNet VGG16 Resnet50 inception4

i/sec std i/sec std i/sec std i/sec std1 1432.81 0.00 128.90 0.00 188.56 0.00 63.80 0.003 166.77 0.26 70.71 0.13 349.65 0.52 139.60 0.104 207.14 0.30 87.68 0.18 395.99 0.88 172.60 0.375 217.33 0.19 96.33 0.07 449.37 0.85 198.29 0.306 248.82 0.30 109.54 0.02 455.33 1.11 219.02 0.357 266.32 0.24 113.63 0.11 470.66 0.63 227.57 0.368 246.53 0.19 112.33 0.15 463.88 0.80 227.75 0.299 272.31 0.33 113.23 0.22 481.72 0.67 223.26 0.3610 283.62 0.27 115.62 0.25 484.82 0.57 225.68 0.23

60 APPENDIX A. COMPLETE RESULTS

Table A.4: Average Images/sec of 3-5 runs with standard deviation foreach model with ring all-reduce (horovod).


i/sec std i/sec std i/sec std i/sec std i/sec std i/sec std i/sec std1 1432.81 0.00 128.90 0.00 109.85 0.00 188.56 0.00 82.03 0.00 440.02 0.00 63.80 0.002 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.03 2007.18 1.89 335.42 0.41 294.34 0.62 504.38 0.61 224.90 0.42 1149.66 6.22 178.95 0.254 2414.63 8.72 439.02 0.63 387.88 0.70 647.68 0.71 293.05 0.50 1497.01 7.08 231.73 0.575 3039.77 5.68 543.53 0.95 482.45 0.61 784.26 1.92 358.87 0.42 1816.52 4.24 283.13 0.326 2717.42 8.20 499.80 1.08 436.19 0.99 908.32 2.95 399.34 0.63 2113.93 12.79 330.07 1.287 3099.85 5.31 582.00 1.60 506.65 1.70 1050.55 1.62 460.18 0.56 2481.44 10.41 383.24 0.248 3487.69 4.58 670.69 1.16 584.85 2.46 1188.63 1.98 523.12 1.93 2826.91 15.74 433.26 0.629 3879.63 13.60 762.74 0.61 668.27 0.47 1318.68 4.28 584.56 0.87 3125.01 5.76 481.81 0.9410 4746.46 168.26 848.39 1.12 751.36 0.72 1443.92 2.54 639.42 0.97 3461.96 14.83 527.54 1.0511 1234.02 12.97 459.20 7.82 453.44 5.24 1379.67 5.42 650.78 0.58 3300.61 47.66 556.51 2.8412 1276.70 66.53 497.18 9.45 489.14 6.99 1520.42 5.44 709.47 0.41 3554.50 31.63 606.54 2.0213 1434.08 7.19 535.77 12.78 521.51 8.08 1652.02 6.88 766.07 1.53 3705.32 108.70 657.27 1.3014 1485.50 78.80 587.30 15.31 565.86 3.83 1771.81 3.32 823.95 3.07 4052.64 105.83 706.63 2.5215 1616.53 11.30 618.42 9.02 587.61 8.59 1890.97 5.72 875.53 3.78 4250.37 116.31 753.88 2.7616 1695.17 18.24 629.99 8.87 607.07 4.36 2024.21 4.60 934.79 2.33 4489.39 230.90 804.55 2.9117 1805.59 14.00 640.09 19.58 606.11 6.86 2152.22 3.95 990.19 1.73 4826.58 149.50 852.91 1.6718 1915.58 6.32 672.51 7.51 625.13 2.40 2278.12 4.39 1039.37 3.50 5056.34 156.32 899.75 2.3819 1997.52 6.99 688.86 6.19 649.60 3.51 2380.86 5.29 1095.00 2.35 5257.19 216.22 946.83 1.6320 2397.04 70.11 725.06 3.34 668.09 6.68 2491.28 11.02 1145.63 3.42 5718.07 114.83 992.43 2.98

www.kth.se

Documents

Analysis and Comparison of Distributed Training …1224181/...Analysis and Comparison of Distributed Training Techniques for Deep Neural Networks in a Dynamic Environment ERMIAS GEBREMESKEL