SCALABLE DISTRIBUTED DEEP LEARNING · 2016-10-17 · SCALABLE DISTRIBUTED DEEP LEARNING. 2 ... Data parallel Deep Learning* on Spark * Jeff Dean, NIPS, 2013 Spark Driver Spark Workers

SEOUL | Oct.7, 2016

Han Hee Song, PhD

Soft On Net

10/7/2016

SCALABLE DISTRIBUTED DEEP LEARNING

2

BATCH PROCESSING FRAMEWORKS FOR DL

• Data parallelism provides efficient big data processing: data collecting, feeding, cleaning

• Easy integration with distributed data storage solutions: HBase, HIVE, HDFS

• Fault-tolerance providing reliable, fail-safe job scheduling

Additionally, for Apache Spark…

• In-memory data processing without storing intermediate results to disk

Nice single-framework for deep learning.However...

3

CHALLENGES OF USING SPARK FOR DL

SHARDS OF DEEP LEARNING TASKS

TRADITIONAL BATCH PROCESSING TASKS

Each task independentlymappable to mappers

Synchronously updatable for global updates among workers

Gradient descent used in DL incur lots of communications among tasks

DL parameter updating is asynchronous

4

DISTRIBUTED DEEP LEARNING

Use multiple model replicas to process different examples concurrently

Data parallel Deep Learning* on Spark

* Jeff Dean, NIPS, 2013

Spark Driver

Spark Workers

HDFS

5

DISTRIBUTED TRAINING ARCHITECTURE

Training data on HDFS

Hadoop YARNMonitor CPU andmain memory usageApache Oozie

Spark driver

Schedule jobs when new training needed

Control flowData flow

Parameter server

Trained model on

HDFS

Shared virtual GPUsCPU

Spark executor

Model trainer

Spark executor

Model trainer

CPU Shared virtual GPUs

CPU

Spark executor

Model trainer

Spark executor

Model trainer

CPU

6

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Norm

alizede

rror

Learningtime(inhours)

Baseline1GPU

UCBerkeleySparkNet(4x4GPU)

DeepDatamining/SoftOnNet(4x4GPU)

YahooCaffeOnSpark (4x4GPU)

TRAINING PERFORMANCE

Source: Source information is 8 pt, italic

Parallelizing training with 16 GPUs yields 5x faster convergence

Training GoogLeNet with Batch Normalization (lower error is better)

7

DISTRIBUTED INFERENCE

Problems

• Significant performance degradation due to IPC communication delay

• Low utilization of GPU

Solution

• In each GPU card, improve GPU occupancy

• In the cluster, promote fine-grained resource allocation among machines

For operational system, promoting efficiency in inference is much more important than in training

Real-time processing of 1,000s of videos

8

0

50

100

150

200

250

1 2 3 4 5 6 7 8 9 10 11

Executiontim

einse

conds

#oftasks

INFERENCE PERFORMANCETime taken to process k 60 second videos (for a single GPU)

Each GPU can only process 2 videos in real-time

9

EFFICIENT USE OF CORES IN EACH GPU

Observation

• Inference on deep network does not use all 2,500 GPU cores

• GPU by default dedicates all cores to each task at a time

Solution: Nvidia Multi-Process Service (MPS)

• Launches multiple kernels concurrently

• Concurrently processes multiple tasks partially occupying GPU cores

10

EFFICIENT USE OF CORES IN EACH GPU

Micro-benchmark using Nvidia Visual Profiler

default single-process mode Multiple-Process Service (MPS) mode

Source: Priyanka, Improving GPU utilization with MPS, Nvidia GTC 2015

11

0

50

100

150

200

250

1 2 3 4 5 6 7 8 9 10 11

Executiontim

einse

conds

#oftasks

Defaultsingleprocessmode

Multi-Process Service(MPS)

INFERENCE PERFORMANCE WITH MPSTime taken to process k 60 second videos (for a single GPU)

Each GPU can now process as many as 8 videos in real-time

12

DISTRIBUTED INFERENCE ARCHITECTURE

Trained model on

HDFS

Hadoop YARNMonitor GPUs,main memory, CPUApache Oozie

Spark driver

Schedule jobs when New video arrives

Video to imageVideo on HDFS

Control flowData flow

Partial GPUCPU

Spark executor

Object detection

Trajectory

tracking

Spark executor

Object detection

Trajectory

tracking

CPU Partial GPUCPU

Spark executor

Object detection

Trajectory

tracking

Spark executor

Object detection

Trajectory

tracking

CPU

Discovered object metadata on HBASE

Annotated Video on HDFS

Partial GPU

Partial GPU

YARN labeling

Assign tasks up to

4 x 8

Assign tasks up to

1 x 8

13

SMART EYE

Cost-effective, reliable, and mobile solution to both small and large consumers

Deep-learning based Video Surveillance on the Cloud

SmartEye

Small-scalehome user

Mid-size enterprise

Copyright © 2016 deep datamining, Inc.

Large, multi-site enterprise,government, etc.

SEOUL | Oct.7, 2016

THANK YOU

Documents

SCALABLE DISTRIBUTED DEEP LEARNING · 2016-10-17 · SCALABLE DISTRIBUTED DEEP LEARNING. 2 ... Data parallel Deep Learning* on Spark * Jeff Dean, NIPS, 2013 Spark Driver Spark Workers