Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
GPU Resource Pooling and the
benefit of deploying CUPTIAlibaba Group
Lingling Jin, Lingjie Xu
© 2019 Alibaba Group
Alibaba is an AI company
© 2019 Alibaba Group
• E-commerce• Search• Advertisement
• Financing &payment
• Video• Logistics
All these services arebacked up by Alibabacloud
GPU pool
IDC data center
© 2019 Alibaba Group
• GPU accelerator powereddata center
GPU servers are managedelastically
© 2019 Alibaba Group
Kubernetes
KVM
However GPUutilization is still low
Agenda
• GPU accelerated inference service pool: multi-tenant
• GPU resource pool over ethernet
© 2019 Alibaba Group
GPU inference service pool
© 2019 Alibaba Group
Model Zoo
• GPUs are allocatedin single cardgranularity
• Capacity isdesigned for worstcase traffic
• Mixing differentinference possible?
Model1
Model1
Model2
Experiment setup
• Single GPU V100 system
• Model: resnet50
• Start inference server, capture NVML GPU utilization at backend
• Client performance test• multiple concurrent request,
• max allowed latency
• desired batchsize
• Calculate server throughput
© 2019 Alibaba Group
NVML GPU utilization
© 2019 Alibaba Group
GPU_utilization: Percent of time over the past
sample period during which one or more
kernels was executing on the GPU
CUPTI
• The CUDA Profiling Tools Interface (CUPTI) enables the creation of profiling and tracing tools that target CUDA applications.
• all event/metrics available in NVPROF• sm_efficiency
• achieved_occupancy
• dram_read/write
• Single/double/half_precision_fu_utilization
• Collect GPU metrics during runtime
• 1%-10% overhead
© 2019 Alibaba Group
SM efficiency: The percentage of time at least one
warp is active on a multiprocessor averaged over all
multiprocessors on the GPU.
CUPTI is what we want.
CUPTI SM Efficiency
© 2019 Alibaba Group
Explore two processes GPUsharing
0.00E+00
1.00E+07
2.00E+07
3.00E+07
4.00E+07
5.00E+07
6.00E+07
7.00E+07
1 6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
10
1
10
6GP
UEl
apse
d/a
ctiv
eC
ycle
Time Stamp (60ms)
One Alexnet program running
SM_active_cycles GPU_elapsed_cycle
• We used cupti to capture GPU elapsed cycle and SM_active_cycle
© 2019 Alibaba Group
Explore multi-process GPUsharing
© 2019 Alibaba Group
0.00E+00
2.00E+07
4.00E+07
6.00E+07
8.00E+07
1.00E+08
1.20E+08
1.40E+08
1.60E+081 3 5 7 9 11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
53 55 57
59
61
63
65
67
69 71 73
75
77
79
81
83
85 87 89
91
93
95
97
99
10
1
GP
UEl
apse
d/a
ctiv
eC
ycle
Time Stamp (60ms)
2 Concurrent Tensorflow Alexnet Inference application share GPU
Task1_elapsed_cycle Task2_elapsed_cycle
Two process time-share GPUresource, NOT concurrent
0.00E+00
2.00E+07
4.00E+07
6.00E+07
8.00E+07
1.00E+08
1.20E+08
1.40E+08
1.60E+08
1 3 5 7 9
11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99
10
1
2 Concurrent Tensorflow Alexnet Inference application share GPU
Task1_active_cycle Task1_elapsed_cycle Task2_active_cycle Task2_elapsed_cycle
GP
UEl
apse
d/a
ctiv
eC
ycle
Val
ue
Time Stamp (60 MS)
Explore multi-process GPUsharing
GPU active cycles doesn’t overlap, so noperf impact when running together
© 2019 Alibaba Group
Running two inference servers
• Single GPU V100 system
• Model• resnet50
• Inception
• Batchsize 4
© 2019 Alibaba Group
Performance with two inferenceservers
© 2019 Alibaba Group
No QOScontrol
Perf overhead
Nvidia inference server
• Each model is implemented as GPUstream, can be run concurrently onGPU
• better control with scheduler
© 2019 Alibaba Group
Nvidia inference server
© 2019 Alibaba Group
1 server vs 2 serverperformance comparison
© 2019 Alibaba Group
Agenda
• GPU accelerated inference service pool: multi-tenant
• GPU resource pool over ethernet
© 2019 Alibaba Group
Why
© 2019 Alibaba Group
DGX2
DGX1 1U GPU server
• Each AI application has differentCPU/GPU ratio
• many types GPU servers
• Single GPU card and GPU server isbecoming so powerful, cannot befully utilized
Distributed storage system-bigsuccess
© 2019 Alibaba Group
• seperated CPU(client) andstorage(chunk server)
• Stable• High performance• Easy to maintain• Low cost
distributed acceleration
Ethernethttp/grpc/rdma/ib
CPU poolWithoutGPU
Accelerationpool
StoragePool
HDD SSD
PhysicalMachine
VirtualMachine
Container
GPU pool
Distributed storage pool
Distributed computingpool
NVME
© 2019 Alibaba Group
GPU pool over ethernet
CUDA Applicationlike TensorFlow (GPU version)
CUDARuntime
Proxy
CUDA Libraries(like cuDNN/
cuBLAS)
Proxy
CUDADriver
Proxy
Communicator
Client Server
Communicator
API Service
Real CUDA Runtime/Library/Driver
GPU
TCP/
RD
MA
Components can be shared by GPU/FPGA/etc.
Components can be generated automatically with minimum manual work
© 2019 Alibaba Group
Challenge 1-A lot of CUDAAPI
• A intermediate and auto-generated YAML file to define the APIs that need to be forwarded to remote server.
• Need to add some information on pointer arguments manually
© 2019 Alibaba Group
Challenge 2-Host memory
• With remote GPU, we need to detect such memories in CUDA kernel parameters to make sure the function correctness• cudaMallocHost
© 2019 Alibaba Group
Challenge 3 - Hide API forwarding Latency• Changing blocking API calls to asynchronous calls
• Benefit: 2x speedup
API run time in server
API run time in client
© 2019 Alibaba Group
BeforeAfter
• website:https://aimatrix.ai
• Code repo:https://github.com/alibaba/ai-matrix
CNN Training Performance
© 2019 Alibaba Group
Summary
• GPU accelerated inference service pool: multi-tenant
• GPU resource pool over ethernet
• All results are preliminary and we are actively working on it
• Discussions and collaborations are welcome
• Email: [email protected]
© 2019 Alibaba Group
Thanks!Q&A
© 2019 Alibaba Group