Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
DLBooster: Boosting End-to-End Deep Learning Workflows with
Offloading Data Preprocessing Pipelines
Yang Cheng, Dan Li, Zhiyuan Guo, Binyao Jiang, Jiaxin Lin, Xi Fan, Xinyi Yu, Wei Bai, Lei Qu, Ran Shu, Peng Cheng, Yongqiang Xiong, and Jianping Wu
Aug. 8th, 2019 | Kyoto, ICPP 2019
Outline
1. Background & Motivation 2. Design & Implementation3. Evaluation 4. Limitation & Discussion5. Conclusion & Future work
2
Outline
1. Background & Motivation 2. Design & Implementation3. Evaluation 4. Limitation & Discussion5. Conclusion & Future work
3
1.1 Deep Learning Ecosystems
Software supports for deep learning
4*Notes: pictures used in this page are borrowed from the Internet.
GPU@NVIDIA FPGA@Intel RDMA NIC@Mellanox
Hardware supports for deep learning
TPU@Google
1.1 Deep Learning Ecosystems (cont.)
Architecture supports for deep learning
Computation & comm. optimizations for deep learning5*Notes: pictures used in this page are borrowed from the Internet.
Parameter Server (PS) (TF) Ring (Horovod) Hybrid comm. (BML,NeurIPS’18)
Replacing float with float16 Poseidon (USENIX ATC’17)
1.2 Deep Learning accelerations
Teams Batch Hardware Software Top-1 Accuracy Time
Microsoft Research 256 Tesla P100 x8 Caffe 75.3% 29h
Facebook 8K Tesla P100 x256 Caffe2 76.3% 1h
SURFsara 32K KNL × 1024 Intel Caffe 75.3% 42min
Google 16K full TPU pod TensorFlow 76.1% 30 min
Preferred Networks 32K KNL x2048 Chainer 74.9% 15 min
Tencent 64K Tesla P40 x2048 TensorFlow 75.8% 6.6 min
Sony 68K Tesla V100 x2176 NNL 75.03% 224 s
6
Table: A record* of training ResNet-50 from different teams
DL applications have been speeded up to a great extend.
* Last updated on Nov. 13th, 2018
1.2 Deep Learning accelerations
Most existing accelerations for DL focus on: Computation of NNs (forward/backward, synchronization)
Original picture@(20*12)
Resized picture@(5*5)
Cropped picture@(3*3) Conv layers FC layers
Input preprocessing Training/Inference
• Distributed training with large batch size• Overlapping computation and communication• Compressing gradients to reduce communication cost• … 7
1.3 An end2end view of DL workflow
Original picture@(20*12)
Resized picture@(5*5)
Cropped picture@(3*3) Conv layers FC layers
Input preprocessing Training/Inference
8
The full stack of DL workflow, take the image DL application as an example
Data preprocessing can significantly affect the overall performance of DL applications, particular in the cloud.
Data preprocessing: Ø preparing data for the computation of NNs
1.3 An end2end view of DL workflow
9
Data preprocessing can significantly affect the overall performance of DL applications, particular in the cloud.
Training AlexNet on ground-truth data (ILSVRC2012), using NVCaffe in a 2P100 server.
The batch size is set to 256 images/GPU
LMDB: an offline backendØ decoding data first, and reloading it
from memory or disk when using it.Ø We spent 2 hours to process the
training data, while the total training costs 8 hours for 100 epochs
CPU-based: an online backendØ online decoding with CPU cores
By default, NVCaffe launches 4 threads for each GPU engine
1.4 Data preprocessing in practice
• Offline backends (LMDB, TFRecord, RecordIO) Ø significant effort (time cost) to process the data firstØ not suitable for the online inference
• Online backends• Burning CPU to decode at runtime is inefficient• Burning GPU is expensive
100610
1320
5100
0100020003000400050006000
CPU P100(TensorFlow)
P100(TensorRT)
V100(TensorRT)
Imag
es p
roce
ssed
per
se
cond
Throughput for Inference on ResNet50 (latency@8ms)
• Each CPU core (E5 2630 @2.4GHz) can decode (including resizing) no more than 300 images per second, using the OpenCV suites.
• Both GPU and CPU are expensively chargedPerformance of inference ResNet-50 based
on typical GPUs deployed in the cloud
In the cloud: (> 90% TFs are deployed)
10
1.5 Motivation
The limitations of existing data preprocessing backends in DL systems motivate us to design DLBooster with:• Offering high-performance decoding services at runtime;• Being Suitable for different DL workflows (offline & online);• Being cost-effective (benefits both users and cloud services
providers in the cloud)
DLBooster selectively offloads data preprocessing workloads to FPGAs
11
Outline
1. Background & Motivation 2. Design & Implementation3. Evaluation 4. Limitation & Discussion5. Conclusion & Future work
12
2 DLBooster Design: Arch overview
DLBooster co-designs software and hardware
13
An overview of DLBooster architecture. There are 4 control planes in the logical view. The data flow is: disk/NIC → FPGA decoder → host memory → device memory
2.1 FPGA decoder design
DataReader & MMU: • parsing cmds;• fetching data from devices• Tracking the status of each dataK-way decoding kernel: • decoding data at runtimeK-way resizer: • Resizing the decoded data to
the fixed sizeThe decoder in FPGA device, take image
DL application as an example
Both decoding kernel and resizer scale free, according to the constraints of FPGA resources and the task in processing
Selectively offload data preprocessing workloads to FPGAs
14
2.2 DLBooster software design
Components of DLBooster:• DLBoosterMgr:
• Isolating the DLBoosterbackend from other backends
• containing one or more FPGAReaders
• Global shared by more computation engines
• FPGAReader: • bound with one FPGA device• Driving the decoder in FPGA Software components of DLBooster
15
2.2 DLBooster software design (cont.)
• More implementation details:• Submission pipelines:
• Pipelining more batches to overlap FPGA computation and data synchronization
• Memory pool:• Allocating and managing memory (1GB) in continuous
space by HugePages, instead of the mmap• Mapping memory between physical address (for FPGA
decoder) and virtual address (for dispatcher)• Asynchronous data dispatcher:
• Synchronizing processed data from the host memory to the GPU memory for multi-GPU engines
16
Outline
1. Background & Motivation 2. Design & Implementation3. Evaluation 4. Limitation & Discussion5. Conclusion & Future work
17
3.1 Configuration
18
Offline training Online inference
HardwareGPU: 2 * Tesla P100, CPU: 2 * Intel Xeon E5 2630v3 @2.4GHz (32 cores in all),
memory: DDR4 64G@2133MHz, disk: Intel NVMe optane 900P , FPGA: Intel Arria 10AX, network: 40Gbps Ethernet
Software OS: Centos-7, CUDA: v9.0, cuDNN: v7.13, NCCL v2.2.13
Models LeNet-5, AlexNet, ResNet-18(fp16) GoogLeNet, VGG-16, ResNet-50
Dataset MNIST, ILSVRC2012 sending ILSVRC2012 at runtime by TCP
DL Engine NVCaffe v0.17 TensorRT v4.0
Baselines LMDB, CPU-based CPU-based, nvJPEG*
End-to-end evaluating DLBooster with ground-truth data in different DL workflows, compared with other backends
Configuration list for evaluation
*nvJPEG is a GPU-based decoder for image DL applications
3.2 Result of offline training
19
Throughput when training DL models by NVCaffe with different data preprocessing backends
Ø In training LeNet-5, all training data are cached in memory. But the overhead of locks and copies (in baselines) is non-ignorable
Ø In training AlexNet and ResNet-18 (fp16), the overhead of locks between multiple threads (in baselines) still hurts the overall performance
3.2 Result of offline training (cont.)
20
CPU cost when training DL models by NVCaffe with different data preprocessing backends:
Ø The CPU-based data preprocessing backend burns more than 10 cores to supply one GPU when training at runtime
Ø There is only 0.3 core consumed by each GPU in DLBooster when offering online decoding services
3.3 Result of online inference
21
Latency of online inferring DL models with TensorRT, based on different backends:
Throughput of online inferring DL models with TensorRT, based on different backends:
3.3 Result of online inference (cont.)
22
CPU cost of online inferring DL models, using TensorRT with different data preprocessing backends:
Ø The CPU-based data preprocessing backend burns more cores to supply one GPU when online inferring DL models
Ø nvJPEG and DLBooster consume acceptable CPU cores when offering online decoding services
Outline
1. Background & Motivation 2. Design & Implementation3. Evaluation 4. Limitation & Discussion5. Conclusion & Future work
23
4 Limitation and Discussion
Trade-off: DLBooster uses extra FPGAs to handle partial data preprocessing workloads• Boosting end-to-end performance of DL workflows• low power consumption by reducing CPU/GPU cost• Potential of deploying DL applications in the cloudConcerns about Programming with FPGAs:• The ecosystems of FPGA is getting better (OpenCL)• extending more decoding kernels for different DL applications
(video, audio, NLP, etc.) [future work]• Further optimizations with FPGA decoder [future work]
24
Outline
1. Background & Motivation 2. Design & Implementation3. Evaluation 4. Limitation & Discussion5. Conclusion & Future work
25
5 Conclusion and Future work
Data preprocessing is becoming a bottleneck in end-to-end DL workflows with GPUs (in the cloud):• Performance degradation• CPU/GPU cost is non-ignorableDLBooster is designed to be an accelerator to speed up end-to-end DL workflows:• Offloading data preprocessing to FPGAs• co-designing software and hardwareResults:• Offering online decoding services• Reducing CPU cost and improving the performance
of DL workflows end-to-end26
5 Conclusion and Future work (cont.)
Extensions for more DL applications• Implementing decoding kernels for Video, Audio, NLP, etc.Further improvements with CPU bypass• Directly fetching data from NIC memory *• Directly writing processed data to GPU memory
* Directly fetching data from NVMe disk is already done in this work27
Thanks
backup
File blocks translation
Adapt to FPGA decoderInput: feed in samples(organized in the format of blocks).Output: the processed data (physical).
30
Hello, H e l l o ,
World W o r l d
Blk.start_1Blk.start_2
Available blks
Trans
HdParm
FPGA processing
cmds(meta[blks])FPGA Driver
H E L L O ,W O R L D
WriteH E L L O ,W O R L D
Map
Physical AddrVirtual AddrHugePages