27
A General Distributed Deep Learning Platform SINGA Team September 4, 2015 @VLDB BOSS Apache SINGA

A General Distributed Deep Learning Platform SINGA Team September 4, 2015 @VLDB BOSS

Embed Size (px)

Citation preview

A General Distributed Deep Learning Platform

SINGA TeamSeptember 4, 2015 @VLDB BOSS

Apache SINGA

Outline• Part one– Background– SINGA

• Overview• Programming Model• System Architecture

– Experimental Study– Research Challenges – Conclusion

• Part two– Basic user guide– Advanced user guide

2

Source websites of each image can be found by click on it

Background: Applications and Models

Directed acyclic connection Recurrent connection Undirected connection

Convolutional Neural Network (CNN)Multi-Layer Perceptron (MLP)Auto-Encoder (AE)

Long Short Term Memory (LSTM)Recurrent Neural Network (RNN)

Restricted Boltzman machine (RBM)Deep Botzman Machine (DBM)Deep Belief Network (DBN)

CNN MLP LSTMAuto-Encoder RBM

Source websites of each image can be found by click on it

Drug detectionImage caption generationMachine translationNatural language processingAcoustic modelingImage/video classification

4

• Objective function

– Loss functions of deep learning models are (usually)non-linear non-convex No closed form solution.

• Mini-batch Stochastic Gradient Descent (SGD)

• Compute Gradients• Update Parameters

1 week

Background: Parameter Training

5

• Synchronous frameworks

• Asynchronous frameworks

worker worker

worker worker

server

worker worker worker worker

---one group

Sandblaster of Google Brain [1] AllReduce in Baidu’s DeepImage[2]

Distributed Hogwild in Caffe[4]

Downpour of Google Brain[1]

worker worker

worker worker

Background: Distributed Training Framework

6

server

worker worker worker worker

coordinator

• General– Different categories of models– Different training frameworks

• Scalable– Scale to a large model and training dataset

• e.g., 1 Billion parameters and 10M images

– Model partition and Data partition

• Easy to use– Simple programming model– Built-in models, Python binding, Web interface– without much awareness of the underlying distributed

platform

• Abstraction for extensibility, scalability and usability

7

CNNRNN

RBM …

HogwildAllReduce

SINGA Design Goals

Outline• Part one– Background– SINGA

• Overview• Programming Model• System Architecture

– Experimental Study– Research Challenges – Conclusion

• Part two– Basic user guide– Advanced user guide

8

SINGA Overview

• Users (or programmers) submit a job configuration for 4 components.

Main Components

• ClusterToplogy

• NeuralNet

• TrainOneBatch

• Updater

9

SINGA Overview

• Users (or programmers) submit a job configuration for 4 components.

• SINGA trains models over the worker-server architecture.‒ Workers compute parameter gradients‒ Servers perform parameter updates

Main Components

• ClusterToplogy

• NeuralNet

• TrainOneBatch

• Updater

10

SINGA Programming Model

11

SINGA Programming Model

12

SINGA Programming Model

Note:BP, BPTT and CD have different flows, so implement different TrainOneBatch

13

• Worker Group– Loads a subset of training data and computes gradients for model replica– Workers within a group run synchronously– Different worker groups run asynchronously

Distributed Training

• Server Group– Maintains one ParamShard– Handles requests of multiple

worker groups for parameter updates

– Synchronize with neighboring groups

• Configuration for AllReduce (Baidu’s DeepImage)– 1 worker group– 1 server group– Co-locate worker and server

Synchronous Frameworks

15

• Configuration for Sandblaster– Separate worker and server groups– 1 server group– 1 worker group

Synchronous Frameworks

16

• Configuration for Downpour– Separate worker and server groups– >=1 worker groups– 1 server group

Asynchronous Frameworks

• Configuration of distributed Hogwild– 1 server group/procs– >=1 worker groups/procs– Co-locate worker and server– Share ParamShard

Asynchronous Frameworks

18

Outline• Part one– Background– SINGA

• Overview• Programming Model• System Architecture

– Experimental Study– Research Challenges – Conclusion

• Part two– Basic user guide– Advanced user guide

19

CIFAR10 dataset: 50K training and 10K test images.

For single node setting- a 24-core server with 500GB

“SINGA-dist”:- disable OpenBlas multi-threading- with Sandblaster topology

Experiment: Training Performance (Synchronous)

20

Experiment: Training Performance (Synchronous)

For multi nodes setting- a 32-node cluster- each node with a quad-core

Intel Xeon 3.1 GHz CPU and 8G memory

- with AllReduce topology

CIFAR10 dataset: 50K training and 10K test images.

For single node setting- a 24-core server with 500GB

“SINGA-dist”:- disable OpenBlas multi-threading- with Sandblaster topology

21

Experiment: Training Performance (Asynchronous)

Caffeon

a singlenode

SINGAon

a single node

22

60,000 iterations

Experiment: Training Performance (Asynchronous)

Caffeon

a singlenode

SINGAon

a single node

SINGAin cluster of 32 nodes

60,000 iterations

• Problem– How to reduce the wall time to reach a certain accuracy by

increasing cluster size?

• Key factors– Efficiency of one training iteration, i.e., time per iteration– Convergence rate, i.e., number of training iterations

• Solutions– Efficiency

• Distributed training over one worker group• Hardware, e.g., GPU

– Convergence rate• Numeric optimization• Multiple groups

– Overhead of distributed training• communication would easily become the bottleneck• Parameter compression[8,9], elastic SGD[5], etc.

24

Challenge: Scalability

Conclusion: Current status

25

Feature Plan

26

References[1] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In NIPS, pages 1232-1240,2012.[2] R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun. Deepimage: Scaling up image recognition. CoRR,abs/1501.02876, 2015.[3] B. Recht, C. Re, S. J. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, pages 693{701, 2011.[4] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.[5] S. Zhang, A. Choromanska, and Y. LeCun. Deep learning with elastic averaging SGD. CoRR, abs/1412.6651, 2014[6] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber. Deep big simple neural nets excel on handwritten digit recognition. CoRR, abs/1003.0358, 2010.[7] D. Jiang, G. Chen, B. C. Ooi, K. Tan, and S. Wu. epic: an extensible and scalable system for processing big data. PVLDB, 7(7):541–552, 2014.[8] Matthieu Courbariaux, Yoshua Bengio, Jean-Pierre David. Low precision storage for deep learning. arXiv:1412.7024[9] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs. 2014[10] C. Zhang and C. Re. Dimmwitted: A study of main-memory statistical analytics. PVLDB, 7(12):1283–1294, 2014.[11] Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, Yann LeCun . Fast Convolutional Nets With fbfft: A GPU Performance Evaluation. 2015