41
Feroz Zahid, Michael Riegler, and Tor Skeie Simula Research Laboratory and University of Oslo Distributed Deep Learning

Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

Feroz Zahid, Michael Riegler, and Tor SkeieSimula Research Laboratory and University of Oslo

Distributed Deep Learning

Page 2: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

Myself

Distributed Deep Learning 2...

• Full professor, Department of informatics, University of Oslo, Norway

• Adjunct research scientist, Simula Research Laboratory, Norway

• Fabriscale Technologies, CTO and co-founder

• Areas of expertise• HPC networking and management/middleware

systems• Cloud computing and virtualization• Industrial Ethernet and wireless networking

Page 3: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

This presentation will introduce distributed deep learning, walk through prominent techniques, and identify existing challenges and future directions

Introduction and Motivation

Existing Techniques and Toolsets

Future Directions

Distributed Deep Learning 3...

Page 4: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

AI, machine learning, deep learning…

Distributed Deep Learning 4...

Page 5: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

What you can do…

Distributed Deep Learning 5...

Image classification

Text processing

Traffic classification

Predictions of characteristics

Climate prediction

Health care

many more….

Deep learning in Computer Vision, Fei-Fei Li et al., 2016

Page 6: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

Classification refers to identifying the category to which a new observation belong based on previous examples

Distributed Deep Learning 6...

Page 7: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

Classification refers to identifying the category to which a new observation belong based on previous examples

Distributed Deep Learning 7...

Input Feature Extraction Classification

Typical ML Flow

Page 8: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

Deep learning is a machine learning technique in which identification and extraction of features is directly done from the dataset

Distributed Deep Learning 8...

Deep Learning System

Cats Cat Dogs

Training

Inference

Training

Automatic FeatureLearning

Page 9: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

Feature creation

Distributed Deep Learning 9...

• Deep learning to create automatically features from data

• Features describe the data content in an abstract representation

• For example face features, edges, color distribution, time series, etc.

Page 10: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

Three pillars of deep learning

Distributed Deep Learning 10...

• Algorithms

• Data

• Computational power

Page 11: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

Supervised and unsupervised learning

Distributed Deep Learning 11...

• Supervised Labelled training data Algorithm learns from

data

• Unsupervised No training data Algorithm learns by itself

Page 12: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

Deep learning is predominately based on artificial neural networks (ANNs). ANNs progressively learn using neurons arranged in many layers

Distributed Deep Learning 12...

Inpu

t Lay

er

Hidden Layers

Out

put L

ayer

Here the Magic (Maths)happens!

Synapses

Page 13: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

In training neural networks, prediction errors are backpropagated for gradually tuning weights and biases to values that predict better

Distributed Deep Learning 13...

𝑥𝑥1

𝑥𝑥2

𝑥𝑥3

𝑥𝑥𝑝𝑝

𝑝𝑝No. ofNeurons

𝑤𝑤1

𝑤𝑤2𝑤𝑤3

𝑤𝑤𝑝𝑝

𝑤𝑤1𝑥𝑥1

𝑤𝑤2𝑥𝑥2𝑤𝑤3𝑥𝑥3

𝑤𝑤𝑝𝑝𝑥𝑥𝑝𝑝

𝑓𝑓 = �𝑖𝑖=1

𝑝𝑝

𝑤𝑤𝑖𝑖 𝑥𝑥𝑖𝑖+𝑏𝑏

ReLU (𝑓𝑓)𝑦𝑦1

𝑦𝑦2

𝑦𝑦𝑛𝑛

𝑛𝑛

prediction

𝑦𝑦1𝑦𝑦2⋮𝑦𝑦𝑛𝑛

𝑎𝑎1𝑎𝑎2⋮𝑎𝑎𝑛𝑛

−∇𝑤𝑤

actual

𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐹𝐹𝐹𝐹𝑛𝑛𝐹𝐹𝐶𝐶𝐹𝐹𝐶𝐶𝑛𝑛

𝐼𝐼𝑛𝑛𝑝𝑝𝐹𝐹𝐶𝐶

𝑂𝑂𝐹𝐹𝐶𝐶𝑝𝑝𝐹𝐹𝐶𝐶

𝑤𝑤12 ReLU (𝑓𝑓)

Image size = 100x100Layers = 16x16, Output = 216322 parameters!

Page 14: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

Deep learning has greatly improved prediction accuracies in various fields, and has become a backbone of artificial intelligence

Distributed Deep Learning 14...

• Improving prediction accuracies– Important in real-world applications such as

disease identification, self-driving cars• Fueling new applications

– Science, Finance, Healthcare, Automotive Image Voice

Time Series Text

Page 15: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

Availability of large amount of data, improved processing capabilities, and advances in neural networks contributed to the deep learning success

Distributed Deep Learning 15...

• Availability of a lot of data– Digital universe is expected to grow to 163 ZB in 2025

• Availability of large datasets– ImageNet, MNIST

• Advancement in hardware capabilities made it possible to execute previously infeasible computational tasks– Co-processors / accelerators– Tesla V-100 capable of 15 TFLOPS of single-precision operations

• Deep learning networks became more advanced– Convolutional networks, Recurrent Neural Networks, Long Term-short memory (LSTM)– Residual networks, Xception, VGG19, InceptionV3, InceptionResNetV2

ImageNet is a large database of (URLs) of about 14 million images with more than 20000 different classes.MNIST contains > 60k hand-written digits with labels.

Page 16: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

Large, complex models and training using large datasets improves classification accuracy of deep learning algorithms

Distributed Deep Learning 16...

• State-of-the-art accuracies are achieved using very large models– Millions / Billions of parameters, Large number of layers and depth

• Neural network of Google Brain1 has 137 billion parameters!

ResNet50 75.9 25,636,712 168

Model Top-1 Accuracy % No. of Parameters Network Depth

Xception 79.0 22,910,480 126

VGG16 71.5 138,357,544 23

VGG19 72.7 143,667,240 26

InceptionV3 78.8 23,851,784 159

InceptionResNetV2 80.4 55,873,736 572

MobileNet 66.5 4,253,964 88

Size and depth of the model, Top-1 accuracy, ImageNet dataset2

Note: Recently, Google’s NASNet3 reached accuracy of 82.7% with 22.6 Million parameters.1Shazeer, Noam, et al. "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer." arXiv preprint arXiv:1701.06538, 2017.2https://keras.io/applications/ 3Zoph, Barret, et al. "Learning transferable architectures for scalable image recognition." arXiv preprint arXiv:1707.07012, 2017.

Page 17: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

Efficient distributed deep learning across a large number of nodes is critical for modern deep learning applications

Distributed Deep Learning 17...

• It takes hours, and even days to train today’s models on a single node– The models will only become more bigger and complex

• The amount of training data will only increase– Increase in GPU processing power cannot cope up

• Billions of parameters will not fit in GPU memory

0

10

20

30

40

50

InceptionV3 ResNet-50 ResNet-152 VGG16

Training Time in Hours, NVIDIA DGX-1, 1 GPU, Tesla P100, ImageNet dataset, TensorFlow

Data: https://www.tensorflow.org/performance/benchmarks

Page 18: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

• Benchmarking distributed deep learning systems– Same hardware, software stack

• Some frameworks more scalable than others– Trained to the same accuracy

• 0.01% change in accuracy could mean a substantial affect on speed-up!

– Same models• Some models more scalable than others

– Due to computation / communication ratio• Scaling Efficiency and Communication Overhead

– Ratio between training time of one iteration on 1 nodes to total training time when distributed over n nodes

• Reporting affected by n nodes with 1x GPU vs 𝑛𝑛8

nodes with 8x GPUs

• Accuracy– Can be affected by the batch size

• A larger batch size is usually proved beneficial for distributed deep learning– Number of epochs required for convergence– Learning rate adaptation, trade-off between runtime and accuracy

Distributed deep learning should be scalable, should not affect the accuracy of the classification, and should fully utilize available resources

Distributed Deep Learning 18...

Proc

essin

g Po

wer

Number of Nodes

Ideal vs Actual ScalabilityIdeal Actual

…Inpu

t Dat

a Mini Batch 1

Mini Batch 2

Mini Batch n

Update Parameters

Goyal, Priya, et al. "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour." arXiv preprint arXiv:1706.02677, 2017.

Page 19: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

This presentation will introduce distributed deep learning, walk through prominent techniques, and identify existing challenges and future directions

Introduction and Motivation

Existing Techniques and Toolsets

Future Directions

Distributed Deep Learning 19...

Page 20: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

Scaling efficiency of distributed deep learning can be improved at different levels, and a co-design approach is needed

Distributed Deep Learning 20...

• Hardware with more processing capabilities

• Interconnects with more bw

Improve Hardware C

• Computationally light models

• Models that scale better

• Models that adapt

Design Better Models B

• Over multiple GPUs on a single node

• Over multiple nodes

Distribute Across Nodes A

Deep LearningFrameworks

HeterogeneousHardware CPUs GPUs FPGAs Interconnect Technologies

Deep LearningAcceleration

Single Node Communication

NVIDIA cuDNNMKL-DNN

NCCLGloo

Multi Node Communication

Page 21: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

21...

Distribute Across Nodes

Page 22: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

There are three parallelization methods employed in distributed implementation of deep learning applications

Distributed Deep Learning 22...

Model Parallelism• Neural network partitioned across machines, Parallelizing mathematical

operations across machines

Data Parallelism• Data partitioned across machines, each machine holds local copy of the

model, synchronization required

Hybrid Approaches• Data partitioning for some parts of neural network, model partitioning for

correctness on some parts, Automatic selection

Page 23: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

Data Parallelism: In Synchronized data parallelism, the algorithm includes two parts: sum of local gradients and broadcast of global weight to workers

Distributed Deep Learning 23...

Worker 1

Model

∇𝑤𝑤1 Data Shard

𝑤𝑤1Worker 2

Model

∇𝑤𝑤2 Data Shard

𝑤𝑤2Worker 3

Model

∇𝑤𝑤3 Data Shard

𝑤𝑤3Worker p

Model

∇𝑤𝑤𝑝𝑝 Data Shard

𝑤𝑤𝑝𝑝

Master

�𝑤𝑤 = �𝑤𝑤 −η𝑝𝑝�𝑗𝑗=1

𝑛𝑛

∇𝑤𝑤𝑗𝑗

∇𝑤𝑤1 ∇𝑤𝑤2 ∇𝑤𝑤3 ∇𝑤𝑤𝑝𝑝

�𝑤𝑤�𝑤𝑤 �𝑤𝑤

�𝑤𝑤�𝑤𝑤

Page 24: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

Worker 1

Model

∇𝑤𝑤1 Data Shard

𝑤𝑤1Worker 2

Model

∇𝑤𝑤2 Data Shard

𝑤𝑤2Worker 3

Model

∇𝑤𝑤3 Data Shard

𝑤𝑤3Worker p

Model

∇𝑤𝑤𝑝𝑝 Data Shard

𝑤𝑤𝑝𝑝

Master

�𝑤𝑤 = �𝑤𝑤 −η𝑝𝑝�𝑗𝑗=1

𝑛𝑛

∇𝑤𝑤𝑗𝑗

∇𝑤𝑤1 ∇𝑤𝑤2 ∇𝑤𝑤3 ∇𝑤𝑤𝑝𝑝

�𝑤𝑤 �𝑤𝑤

�𝑤𝑤�𝑤𝑤

Data Parallelism: Barrier before each update, overlapping communication and computation, or asynchronous approaches are generally employed

Distributed Deep Learning 24...

• Barrier Synchronization– All workers wait after each mini-batch for global

synchronization via master (aka parameter server)– Overlap communication – Accuracy is maintained

• Asynchronous Approach– Master sends the global weights to the worker as soon as it is done

calculation after a local update is received– Performance is better than barrier synchronization– Workers remain slightly off-sync, affects accuracy

Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances in neural information processing systems. 2012.

Page 25: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

• Spark-based Solutions– Integration of deep learning frameworks, e.g.

TensorFlow with Spark• ClusterManager takes role of parameter server • Mesos used as resource manager

– Performance issues• Iterative computations

Data Parallelism: Big data processing frameworks and resource management systems also make good candidates for distributed deep learning

Distributed Deep Learning 25...

Moritz, Philipp, et al. "Sparknet: Training deep networks in spark." arXiv preprint arXiv:1511.06051, 2015.Kim, Hanjoo, et al. "Deepspark: Spark-based deep learning supporting asynchronous updates and caffe compatibility." CoRR, vol. abs/1602.08191, 2016.

Page 26: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

• Highest accuracy possible– As if model is calculated on a single machine

• Feasible when number of features are huge– Or the computational capacity per machine is very low

• Frequent synchronization, Scaling limitations– Problems with larger batch size

Model Parallelism: Neural network is divided into several mini-models and distributed across multiple nodes

Distributed Deep Learning 26...

Dettmers, Tim. "8-bit approximations for parallelism in deep learning." arXiv preprint arXiv:1511.04561, 2015.

Page 27: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

Hybrid Approach: Some parts of the neural network are divided using model parallelism while other parts employ data parallelism

Distributed Deep Learning 27...

Source: http://singa.incubator.apache.org/

Example: Apache Singa

Page 28: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

Hybrid Approach: Automatic selection of the best possible parallelism model has also been proposed in literature

Distributed Deep Learning 28...

Source: Tofu – Parallelizing Deep Learning Systems with Automatic Tiling, Minjie Wang, 2017

Example: Tofu

Automatic generation of Neural networks using evolutionary algorithm, ORNL.

Page 29: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

Communication efficiency plays an important role in the scalability of the distributed deep learning algorithms

Distributed Deep Learning 29...

Communication overhead is the difference between runtime of one iteration when distributed over n processing units and runtime of one iteration on a single unit

Communication Algorithm and Libraries• Topology-aware, Adaptive to the configuration

GPU 0

GPU 1\\\\\\\\

\

GPU 2GPU 3

GPU 4

Interconnect Bandwidth and Latency• Inter-node communication, Intra-node communication

Page 30: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

In large-scale systems, a hierarchy of interconnect technologies are used giving different bandwidths, latencies, and connectivity

Distributed Deep Learning 30...

NVLink60-80 GB/s with 4x

PCIe Gen312-16 GB/s with 16x

IB EDR / IB HDR100G/40G Ethernet

100 Gbit/s with 4x EDR

Page 31: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

In large-scale systems, a hierarchy of interconnect technologies are used yielding different bandwidths, latencies, and connectivity

Distributed Deep Learning 31...

NVLink60-80 GB/s with 4x

IB EDR / IB HDR100G/40G Ethernet

100 Gbit/s with 4x EDR

PCIe Gen312-16 GB/s with 16x

• In such distributed scenarios, topology and routing also plays an important role– Communication algorithm must adapt

Page 32: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

Communication algorithm can optimize utilization of the available link capacity by avoiding contention in data transfers

Distributed Deep Learning 32...

• Collective operations– All-Reduce operations in synchronous data parallelism– All-Gather and broadcast in asynchronous data parallelism / model

parallelism

• Communication Libraries– MPI is very mature, standardized, and provide excellent scale-out

performance• Scale-Up?

– NICCL• Scale Up, Across multiple nodes is an issue even though supported now

Page 33: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

35...

Design Scalable Models

Page 34: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

Distributed deep learning can be accelerated through the use of computationally lighter & scalable models and define-by-run methodologies

Distributed Deep Learning 36...

Computationally Lighter Models• Some models are inherently more computationally expensive than others,

VGG vs Inception

Scalable Models• Models that are more scalable, tolerant to delayed parameter updates

Define-by-Run Paradigm• Dynamic graph updates, Potential for adaptation to the changing data

Page 35: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

Models can also be designed in a way that they are more scalable and suited for the distributed deep learning

Distributed Deep Learning 38...

Source: https://www.tensorflow.org/performance/benchmarks

Example: VGG16 vs ResNet-152 – Single node, ImageNet, TensorFlow

Page 36: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

In Define-by-Run frameworks, graphs constructions on the fly giving more flexibility and potential performance improvements

Distributed Deep Learning 39...

• Chainer– The first define-by-run deep learning framework

• TensorFlow– Eager execution

Source: Kenta Oono, Preferred Networks Inc.

Define and Run Define-by-Run

StaticGraph

Construction

DynamicGraph

Constructionf g

f g

f

f g

x

x

y

x z x y z

Data Feed

Page 37: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

40...

Improve Hardware

Page 38: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

Hardware advances will still play an integral role in improving efficiency of deep learning algorithms

Distributed Deep Learning 41...

Current Trends• High-performance GPUs, Off-the-shelf deep learning and AI systems,

FPGAs, TPUs

0 10 20 30 40 50 60 70 80

Intel SkyLake

KNL 7250

4x Tesla P100

8x Tesla P100

Time in Minutes to train Resnet-50 (90 epochs)

Preferred Networks (2017)

Y. You et al. (2017)

Y. You et al. (2017)

IBM Power DDL (2017)

Facebook (2017)

Condreanu et al. (2017)

Currently big players with a lot of (advanced) hardware are winning!

Page 39: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

This presentation will introduce distributed deep learning, walk through prominent techniques, and identify existing challenges and future directions

Introduction and Motivation

Existing Techniques and Toolsets

Future Directions

Distributed Deep Learning 42...

Page 40: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

Future research directions include a holistic approach for variety of workloads & heterogeneous environments, and NN-aware communication

Distributed Deep Learning 43...

Neural Network Aware Communication• Pipelining communication with computation, Topology-aware,

Offloading

Holistic Approach• New workloads are emerging, different toolsets, in modern apps

workload of different types will co-exist

Heterogeneous Environments• Accelerators, Adaptive to the configuration, Dynamic resource

allocations, Clouds (challenges related to multi-tenancy)

Improved Algorithms and Models• Less communication requirements, Good accuracy with large batch

sizes, adaptive learning rates, Automatic neural network

Page 41: Distributed Deep Learning · Distributed deep learning should be scalable, should not affect the accuracy ... Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances

References – Incomplete list

45...

Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances in neural information processing systems. 2012.

Cho, Minsik, et al. "PowerAI DDL." arXiv preprint arXiv:1708.02188 (2017).

Goyal, Priya, et al. "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour." arXiv preprint arXiv:1706.02677 (2017).

https://www.tensorflow.org/

Akiba, Takuya, Shuji Suzuki, and Keisuke Fukuda. "Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes." arXiv preprint arXiv:1711.04325 (2017).

Shazeer, Noam, et al. "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer." arXiv preprint arXiv:1701.06538 (2017).

Chilimbi, Trishul M., et al. "Project Adam: Building an Efficient and Scalable Deep Learning Training System." OSDI. Vol. 14. 2014.

Abadi, Martín, et al. "Tensorflow: Large-scale machine learning on heterogeneous distributed systems." arXiv preprint arXiv:1603.04467 (2016).

Tokui, Seiya, et al. "Chainer: a next-generation open source framework for deep learning." Proceedings of workshop on machine learning systems (LearningSys) in the twenty-ninth annual conference on neural information processing systems (NIPS). Vol. 5. 2015.

Jozefowicz, Rafal, et al. "Exploring the limits of language modeling." arXiv preprint arXiv:1602.02410 (2016).

https://www.nvidia.com/en-us/deep-learning-ai/

https://keras.io/

Bottou, Léon. "Large-scale machine learning with stochastic gradient descent." Proceedings of COMPSTAT'2010. Physica-Verlag HD, 2010. 177-186.

Zoph, Barret, et al. "Learning transferable architectures for scalable image recognition." arXiv preprint arXiv:1707.07012 (2017).