DLBooster: Boosting End-to-End Deep Learning Workflows

DLBooster: Boosting End-to-End Deep Learning Workflows with

Offloading Data Preprocessing Pipelines

Yang Cheng, Dan Li, Zhiyuan Guo, Binyao Jiang, Jiaxin Lin, Xi Fan, Xinyi Yu, Wei Bai, Lei Qu, Ran Shu, Peng Cheng, Yongqiang Xiong, and Jianping Wu

Aug. 8th, 2019 | Kyoto, ICPP 2019

Outline

1. Background & Motivation 2. Design & Implementation3. Evaluation 4. Limitation & Discussion5. Conclusion & Future work

2

Outline


3

1.1 Deep Learning Ecosystems

Software supports for deep learning

4*Notes: pictures used in this page are borrowed from the Internet.

GPU@NVIDIA FPGA@Intel RDMA NIC@Mellanox

Hardware supports for deep learning

TPU@Google

1.1 Deep Learning Ecosystems (cont.)

Architecture supports for deep learning

Computation & comm. optimizations for deep learning5*Notes: pictures used in this page are borrowed from the Internet.

Parameter Server (PS) (TF) Ring (Horovod) Hybrid comm. (BML,NeurIPS’18)

Replacing float with float16 Poseidon (USENIX ATC’17)

1.2 Deep Learning accelerations

Teams Batch Hardware Software Top-1 Accuracy Time

Microsoft Research 256 Tesla P100 x8 Caffe 75.3% 29h

Facebook 8K Tesla P100 x256 Caffe2 76.3% 1h

SURFsara 32K KNL × 1024 Intel Caffe 75.3% 42min

Google 16K full TPU pod TensorFlow 76.1% 30 min

Preferred Networks 32K KNL x2048 Chainer 74.9% 15 min

Tencent 64K Tesla P40 x2048 TensorFlow 75.8% 6.6 min

Sony 68K Tesla V100 x2176 NNL 75.03% 224 s

6

Table: A record* of training ResNet-50 from different teams

DL applications have been speeded up to a great extend.

* Last updated on Nov. 13th, 2018

1.2 Deep Learning accelerations

Most existing accelerations for DL focus on: Computation of NNs (forward/backward, synchronization)

Original picture@(20*12)

Resized picture@(5*5)

Cropped picture@(3*3) Conv layers FC layers

Input preprocessing Training/Inference

• Distributed training with large batch size• Overlapping computation and communication• Compressing gradients to reduce communication cost• … 7

1.3 An end2end view of DL workflow

Original picture@(20*12)

Resized picture@(5*5)

Cropped picture@(3*3) Conv layers FC layers

Input preprocessing Training/Inference

8

The full stack of DL workflow, take the image DL application as an example

Data preprocessing can significantly affect the overall performance of DL applications, particular in the cloud.

Data preprocessing: Ø preparing data for the computation of NNs

1.3 An end2end view of DL workflow

9

Data preprocessing can significantly affect the overall performance of DL applications, particular in the cloud.

Training AlexNet on ground-truth data (ILSVRC2012), using NVCaffe in a 2P100 server.

The batch size is set to 256 images/GPU

LMDB: an offline backendØ decoding data first, and reloading it

from memory or disk when using it.Ø We spent 2 hours to process the

training data, while the total training costs 8 hours for 100 epochs

CPU-based: an online backendØ online decoding with CPU cores

By default, NVCaffe launches 4 threads for each GPU engine

1.4 Data preprocessing in practice

• Offline backends (LMDB, TFRecord, RecordIO) Ø significant effort (time cost) to process the data firstØ not suitable for the online inference

• Online backends• Burning CPU to decode at runtime is inefficient• Burning GPU is expensive

100610

1320

5100

0100020003000400050006000

CPU P100(TensorFlow)

P100(TensorRT)

V100(TensorRT)

Imag

es p

roce

ssed

per

se

cond

Throughput for Inference on ResNet50 (latency@8ms)

• Each CPU core (E5 2630 @2.4GHz) can decode (including resizing) no more than 300 images per second, using the OpenCV suites.

• Both GPU and CPU are expensively chargedPerformance of inference ResNet-50 based

on typical GPUs deployed in the cloud

In the cloud: (> 90% TFs are deployed)

10

1.5 Motivation

The limitations of existing data preprocessing backends in DL systems motivate us to design DLBooster with:• Offering high-performance decoding services at runtime;• Being Suitable for different DL workflows (offline & online);• Being cost-effective (benefits both users and cloud services

providers in the cloud)

DLBooster selectively offloads data preprocessing workloads to FPGAs

11

Outline


12

2 DLBooster Design: Arch overview

DLBooster co-designs software and hardware

13

An overview of DLBooster architecture. There are 4 control planes in the logical view. The data flow is: disk/NIC → FPGA decoder → host memory → device memory

2.1 FPGA decoder design

DataReader & MMU: • parsing cmds;• fetching data from devices• Tracking the status of each dataK-way decoding kernel: • decoding data at runtimeK-way resizer: • Resizing the decoded data to

the fixed sizeThe decoder in FPGA device, take image

DL application as an example

Both decoding kernel and resizer scale free, according to the constraints of FPGA resources and the task in processing

Selectively offload data preprocessing workloads to FPGAs

14

2.2 DLBooster software design

Components of DLBooster:• DLBoosterMgr:

• Isolating the DLBoosterbackend from other backends

• containing one or more FPGAReaders

• Global shared by more computation engines

• FPGAReader: • bound with one FPGA device• Driving the decoder in FPGA Software components of DLBooster

15

2.2 DLBooster software design (cont.)

• More implementation details:• Submission pipelines:

• Pipelining more batches to overlap FPGA computation and data synchronization

• Memory pool:• Allocating and managing memory (1GB) in continuous

space by HugePages, instead of the mmap• Mapping memory between physical address (for FPGA

decoder) and virtual address (for dispatcher)• Asynchronous data dispatcher:

• Synchronizing processed data from the host memory to the GPU memory for multi-GPU engines

16

Outline


17

3.1 Configuration

18

Offline training Online inference

HardwareGPU: 2 * Tesla P100, CPU: 2 * Intel Xeon E5 2630v3 @2.4GHz (32 cores in all),

memory: DDR4 64G@2133MHz, disk: Intel NVMe optane 900P , FPGA: Intel Arria 10AX, network: 40Gbps Ethernet

Software OS: Centos-7, CUDA: v9.0, cuDNN: v7.13, NCCL v2.2.13

Models LeNet-5, AlexNet, ResNet-18(fp16) GoogLeNet, VGG-16, ResNet-50

Dataset MNIST, ILSVRC2012 sending ILSVRC2012 at runtime by TCP

DL Engine NVCaffe v0.17 TensorRT v4.0

Baselines LMDB, CPU-based CPU-based, nvJPEG*

End-to-end evaluating DLBooster with ground-truth data in different DL workflows, compared with other backends

Configuration list for evaluation

*nvJPEG is a GPU-based decoder for image DL applications

3.2 Result of offline training

19

Throughput when training DL models by NVCaffe with different data preprocessing backends

Ø In training LeNet-5, all training data are cached in memory. But the overhead of locks and copies (in baselines) is non-ignorable

Ø In training AlexNet and ResNet-18 (fp16), the overhead of locks between multiple threads (in baselines) still hurts the overall performance

3.2 Result of offline training (cont.)

20

CPU cost when training DL models by NVCaffe with different data preprocessing backends:

Ø The CPU-based data preprocessing backend burns more than 10 cores to supply one GPU when training at runtime

Ø There is only 0.3 core consumed by each GPU in DLBooster when offering online decoding services

3.3 Result of online inference

21

Latency of online inferring DL models with TensorRT, based on different backends:

Throughput of online inferring DL models with TensorRT, based on different backends:

3.3 Result of online inference (cont.)

22

CPU cost of online inferring DL models, using TensorRT with different data preprocessing backends:

Ø The CPU-based data preprocessing backend burns more cores to supply one GPU when online inferring DL models

Ø nvJPEG and DLBooster consume acceptable CPU cores when offering online decoding services

Outline


23

4 Limitation and Discussion

Trade-off: DLBooster uses extra FPGAs to handle partial data preprocessing workloads• Boosting end-to-end performance of DL workflows• low power consumption by reducing CPU/GPU cost• Potential of deploying DL applications in the cloudConcerns about Programming with FPGAs:• The ecosystems of FPGA is getting better (OpenCL)• extending more decoding kernels for different DL applications

(video, audio, NLP, etc.) [future work]• Further optimizations with FPGA decoder [future work]

24

Outline


25

5 Conclusion and Future work

Data preprocessing is becoming a bottleneck in end-to-end DL workflows with GPUs (in the cloud):• Performance degradation• CPU/GPU cost is non-ignorableDLBooster is designed to be an accelerator to speed up end-to-end DL workflows:• Offloading data preprocessing to FPGAs• co-designing software and hardwareResults:• Offering online decoding services• Reducing CPU cost and improving the performance

of DL workflows end-to-end26

5 Conclusion and Future work (cont.)

Extensions for more DL applications• Implementing decoding kernels for Video, Audio, NLP, etc.Further improvements with CPU bypass• Directly fetching data from NIC memory *• Directly writing processed data to GPU memory

* Directly fetching data from NVMe disk is already done in this work27

Thanks

backup

File blocks translation

Adapt to FPGA decoderInput: feed in samples(organized in the format of blocks).Output: the processed data (physical).

30

Hello, H e l l o ,

World W o r l d

Blk.start_1Blk.start_2

Available blks

Trans

HdParm

FPGA processing

cmds(meta[blks])FPGA Driver

H E L L O ,W O R L D

WriteH E L L O ,W O R L D

Map

Physical AddrVirtual AddrHugePages

Documents

DLBooster: Boosting End-to-End Deep Learning Workflows