Industrial Level Deep Learning Training Infrastructure

Industrial Level Deep Learning Training Infrastructure—the Practice and Experience from SenseTime

Shengen Yan

SenseTime Group Limited.

The Success of Deep Learning

2006-01 2007-01 2008-01 2009-01 2010-01 2011-01 2012-01 2013-01 2014-01 2015-01 2016-01

Google Search

AlexNet won ImageNet

What Lead to the Success?

Model CapacityThe Key to High Performance

5 8 22

169

1207

LeNet AlexNet (2012) GoogLeNet (2014) ResNet (2016) Ours

# Layers

Computation power

Years months weeks days

Accelerate the training time from several years to several days!

Deep Learning PackageA deep learning framework that is efficient, scalable, and flexible.

DeepLinkA large-scale cluster platform designed for deep learning.

ApplicationsDelivers many application models

01

02

03

Deep Learning is Complicated

Deep Learning community developedframeworks to make the life easier.

GoogleNet (2014)

Deep learning Training Frameworks

‣SenseTime Deep Learning training Package

• Memory efficient

• Computation efficient

• Both model parallel & data parallel

• Support huge model

• Scalability

Memory Footprint Optimization

high level compiler backend optimization algorithms on intermediate representation.

Optimizations: liveness analysis, computation graph

Seeing

Perceiving

Generated Graph with mirror(re-compute) node

Chen T, Xu B, Zhang C, et al. Training deep nets with sublinear memory cost[J]. arXiv preprint arXiv:1604.06174, 2016.

Memory Footprint Optimization

Model Capacity

Memory usage efficiency, higher is better

0

20

40

60

80

100

120

140

VGG ResNet50 ResNet152 Inception V4 ResNet269 Inception ResNet

Ours MxNet TensorFlow Chainer Caffe Torch

Single-GPU Performance

Batch-32 Batch-64 Batch-128Caffe 497.5 1045 1965Chainer 200 290 543TensorFlow 178.6 315.7 587.2Parrots 122.7 225.6 471

0

500

1000

1500

2000

2500

milliseconds / iteration

Caffe Chainer TensorFlow Parrots

Communication Optimization

Support Multi-GPUs and Multi-Nodes

Three procedures: Copy, Allreduce, Copy

Optimizations:

• Master-slave threads to overlap the communication and computation overhead

• GPU direct communication

• Ring allreduce message passing

GPU0 GPU1 GPU3GPU2

CPU Memory

Other NodesAllreduce

CopyCopy

Scalability

0

0.2

0.4

0.6

0.8

1

1.2

0

2000

4000

6000

8000

10000

12000

1 2 3 4 8 16 24 32

# GPUs

millisec/iter scale efficiency

single node multiple nodes




01

02

03

The role of supercomputer

It just like highway in the city

— It is a key infrastructure of AI

Supercomputing Centers for AIThe key infrastructures for AI research.

DATA

COMPPUT-

ATIONMODEL

DeepLink

Challenges

‣ Interconnects at multiple levels

• GPUs, Nodes, Sub-networks

‣Distributed data

• Random access becomes particularly difficult

‣Scale vs. Stability

• Failures of individual nodes/links

‣Human resources

• Engineers who understand both Deep Learning & HPC are difficult to come by

DeepLink ClustersDesigned for Deep Learning

Software

Hardware

Co-design

High-

performance

Hardware

Customized

Middlewares

Maximize respective strengths while ensuring optimal cooperation.

• High speed interconnects

• High performance GPU computing

• Efficient distributed storage

• Distributed storage & cache system (optimized for small files)

• Distributed deep learning framework

• Task scheduling & monitoring

Platform overview

Heterogeneous deep learning super computer

High speed storage system

Operation/Maintenance/Monitoring System

Lightweight virtualization

Task scheduling system

Distributed training software

Deep Learning Training Visualization System

Customized communication library for deep learning

Computation library

Distributed cache system

Softw

arePlatfo

rm

Training Visualization

DeepLink in SenseTime

>3000 GPUs




01

02

03

THANK YOU

Documents

Industrial Level Deep Learning Training Infrastructure