Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Industrial Level Deep Learning Training Infrastructure—the Practice and Experience from SenseTime
Shengen Yan
SenseTime Group Limited.
The Success of Deep Learning
2006-01 2007-01 2008-01 2009-01 2010-01 2011-01 2012-01 2013-01 2014-01 2015-01 2016-01
Google Search
AlexNet won ImageNet
What Lead to the Success?
Model CapacityThe Key to High Performance
5 8 22
169
1207
LeNet AlexNet (2012) GoogLeNet (2014) ResNet (2016) Ours
# Layers
Computation power
Years months weeks days
Accelerate the training time from several years to several days!
Deep Learning PackageA deep learning framework that is efficient, scalable, and flexible.
DeepLinkA large-scale cluster platform designed for deep learning.
ApplicationsDelivers many application models
01
02
03
Deep Learning is Complicated
Deep Learning community developedframeworks to make the life easier.
GoogleNet (2014)
Deep learning Training Frameworks
‣SenseTime Deep Learning training Package
• Memory efficient
• Computation efficient
• Both model parallel & data parallel
• Support huge model
• Scalability
Memory Footprint Optimization
high level compiler backend optimization algorithms on intermediate representation.
Optimizations: liveness analysis, computation graph
Seeing
Perceiving
Generated Graph with mirror(re-compute) node
Chen T, Xu B, Zhang C, et al. Training deep nets with sublinear memory cost[J]. arXiv preprint arXiv:1604.06174, 2016.
Memory Footprint Optimization
Model Capacity
Memory usage efficiency, higher is better
0
20
40
60
80
100
120
140
VGG ResNet50 ResNet152 Inception V4 ResNet269 Inception ResNet
Ours MxNet TensorFlow Chainer Caffe Torch
Single-GPU Performance
Batch-32 Batch-64 Batch-128Caffe 497.5 1045 1965Chainer 200 290 543TensorFlow 178.6 315.7 587.2Parrots 122.7 225.6 471
0
500
1000
1500
2000
2500
milliseconds / iteration
Caffe Chainer TensorFlow Parrots
Communication Optimization
Support Multi-GPUs and Multi-Nodes
Three procedures: Copy, Allreduce, Copy
Optimizations:
• Master-slave threads to overlap the communication and computation overhead
• GPU direct communication
• Ring allreduce message passing
GPU0 GPU1 GPU3GPU2
CPU Memory
Other NodesAllreduce
CopyCopy
Scalability
0
0.2
0.4
0.6
0.8
1
1.2
0
2000
4000
6000
8000
10000
12000
1 2 3 4 8 16 24 32
# GPUs
millisec/iter scale efficiency
single node multiple nodes
Deep Learning PackageA deep learning framework that is efficient, scalable, and flexible.
DeepLinkA large-scale cluster platform designed for deep learning.
ApplicationsDelivers many application models
01
02
03
The role of supercomputer
It just like highway in the city
— It is a key infrastructure of AI
Supercomputing Centers for AIThe key infrastructures for AI research.
DATA
COMPPUT-
ATIONMODEL
DeepLink
Challenges
‣ Interconnects at multiple levels
• GPUs, Nodes, Sub-networks
‣Distributed data
• Random access becomes particularly difficult
‣Scale vs. Stability
• Failures of individual nodes/links
‣Human resources
• Engineers who understand both Deep Learning & HPC are difficult to come by
DeepLink ClustersDesigned for Deep Learning
Software
Hardware
Co-design
High-
performance
Hardware
Customized
Middlewares
Maximize respective strengths while ensuring optimal cooperation.
• High speed interconnects
• High performance GPU computing
• Efficient distributed storage
• Distributed storage & cache system (optimized for small files)
• Distributed deep learning framework
• Task scheduling & monitoring
Platform overview
Heterogeneous deep learning super computer
High speed storage system
Operation/Maintenance/Monitoring System
Lightweight virtualization
Task scheduling system
Distributed training software
Deep Learning Training Visualization System
Customized communication library for deep learning
Computation library
Distributed cache system
Softw
arePlatfo
rm
Training Visualization
DeepLink in SenseTime
>3000 GPUs
Deep Learning PackageA deep learning framework that is efficient, scalable, and flexible.
DeepLinkA large-scale cluster platform designed for deep learning.
ApplicationsDelivers many application models
01
02
03
THANK YOU