Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
SEOUL | Oct.7, 2016
Han Hee Song, PhD
Soft On Net
10/7/2016
SCALABLE DISTRIBUTED DEEP LEARNING
2
BATCH PROCESSING FRAMEWORKS FOR DL
• Data parallelism provides efficient big data processing: data collecting, feeding, cleaning
• Easy integration with distributed data storage solutions: HBase, HIVE, HDFS
• Fault-tolerance providing reliable, fail-safe job scheduling
Additionally, for Apache Spark…
• In-memory data processing without storing intermediate results to disk
Nice single-framework for deep learning.However...
3
CHALLENGES OF USING SPARK FOR DL
SHARDS OF DEEP LEARNING TASKS
TRADITIONAL BATCH PROCESSING TASKS
Each task independentlymappable to mappers
Synchronously updatable for global updates among workers
Gradient descent used in DL incur lots of communications among tasks
DL parameter updating is asynchronous
4
DISTRIBUTED DEEP LEARNING
Use multiple model replicas to process different examples concurrently
Data parallel Deep Learning* on Spark
* Jeff Dean, NIPS, 2013
Spark Driver
Spark Workers
HDFS
5
DISTRIBUTED TRAINING ARCHITECTURE
Training data on HDFS
Hadoop YARNMonitor CPU andmain memory usageApache Oozie
Spark driver
Schedule jobs when new training needed
Control flowData flow
Parameter server
Trained model on
HDFS
Shared virtual GPUsCPU
Spark executor
Model trainer
Spark executor
Model trainer
CPU Shared virtual GPUs
CPU
Spark executor
Model trainer
Spark executor
Model trainer
CPU
6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Norm
alizede
rror
Learningtime(inhours)
Baseline1GPU
UCBerkeleySparkNet(4x4GPU)
DeepDatamining/SoftOnNet(4x4GPU)
YahooCaffeOnSpark (4x4GPU)
TRAINING PERFORMANCE
Source: Source information is 8 pt, italic
Parallelizing training with 16 GPUs yields 5x faster convergence
Training GoogLeNet with Batch Normalization (lower error is better)
7
DISTRIBUTED INFERENCE
Problems
• Significant performance degradation due to IPC communication delay
• Low utilization of GPU
Solution
• In each GPU card, improve GPU occupancy
• In the cluster, promote fine-grained resource allocation among machines
For operational system, promoting efficiency in inference is much more important than in training
Real-time processing of 1,000s of videos
8
0
50
100
150
200
250
1 2 3 4 5 6 7 8 9 10 11
Executiontim
einse
conds
#oftasks
INFERENCE PERFORMANCETime taken to process k 60 second videos (for a single GPU)
Each GPU can only process 2 videos in real-time
9
EFFICIENT USE OF CORES IN EACH GPU
Observation
• Inference on deep network does not use all 2,500 GPU cores
• GPU by default dedicates all cores to each task at a time
Solution: Nvidia Multi-Process Service (MPS)
• Launches multiple kernels concurrently
• Concurrently processes multiple tasks partially occupying GPU cores
10
EFFICIENT USE OF CORES IN EACH GPU
Micro-benchmark using Nvidia Visual Profiler
default single-process mode Multiple-Process Service (MPS) mode
Source: Priyanka, Improving GPU utilization with MPS, Nvidia GTC 2015
11
0
50
100
150
200
250
1 2 3 4 5 6 7 8 9 10 11
Executiontim
einse
conds
#oftasks
Defaultsingleprocessmode
Multi-Process Service(MPS)
INFERENCE PERFORMANCE WITH MPSTime taken to process k 60 second videos (for a single GPU)
Each GPU can now process as many as 8 videos in real-time
12
DISTRIBUTED INFERENCE ARCHITECTURE
Trained model on
HDFS
Hadoop YARNMonitor GPUs,main memory, CPUApache Oozie
Spark driver
Schedule jobs when New video arrives
Video to imageVideo on HDFS
Control flowData flow
Partial GPUCPU
Spark executor
Object detection
Trajectory
tracking
Spark executor
Object detection
Trajectory
tracking
CPU Partial GPUCPU
Spark executor
Object detection
Trajectory
tracking
Spark executor
Object detection
Trajectory
tracking
CPU
Discovered object metadata on HBASE
Annotated Video on HDFS
Partial GPU
Partial GPU
YARN labeling
Assign tasks up to
4 x 8
Assign tasks up to
1 x 8
13
SMART EYE
Cost-effective, reliable, and mobile solution to both small and large consumers
Deep-learning based Video Surveillance on the Cloud
SmartEye
Small-scalehome user
Mid-size enterprise
Copyright © 2016 deep datamining, Inc.
Large, multi-site enterprise,government, etc.
SEOUL | Oct.7, 2016
THANK YOU