GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

GPU Resource Pooling and the

benefit of deploying CUPTIAlibaba Group

Lingling Jin, Lingjie Xu

© 2019 Alibaba Group

Alibaba is an AI company


• E-commerce• Search• Advertisement

• Financing &payment

• Video• Logistics

All these services arebacked up by Alibabacloud

GPU pool

IDC data center


• GPU accelerator powereddata center

GPU servers are managedelastically


Kubernetes

KVM

However GPUutilization is still low

Agenda

• GPU accelerated inference service pool: multi-tenant

• GPU resource pool over ethernet


GPU inference service pool


Model Zoo

• GPUs are allocatedin single cardgranularity

• Capacity isdesigned for worstcase traffic

• Mixing differentinference possible?

Model1

Model1

Model2

Experiment setup

• Single GPU V100 system

• Model: resnet50

• Start inference server, capture NVML GPU utilization at backend

• Client performance test• multiple concurrent request,

• max allowed latency

• desired batchsize

• Calculate server throughput


NVML GPU utilization


GPU_utilization: Percent of time over the past

sample period during which one or more

kernels was executing on the GPU

CUPTI

• The CUDA Profiling Tools Interface (CUPTI) enables the creation of profiling and tracing tools that target CUDA applications.

• all event/metrics available in NVPROF• sm_efficiency

• achieved_occupancy

• dram_read/write

• Single/double/half_precision_fu_utilization

• Collect GPU metrics during runtime

• 1%-10% overhead


SM efficiency: The percentage of time at least one

warp is active on a multiprocessor averaged over all

multiprocessors on the GPU.

CUPTI is what we want.

CUPTI SM Efficiency


Explore two processes GPUsharing

0.00E+00

1.00E+07

2.00E+07

3.00E+07

4.00E+07

5.00E+07

6.00E+07

7.00E+07

1 6

11

16

21

26

31

36

41

46

51

56

61

66

71

76

81

86

91

96

10

1

10

6GP

UEl

apse

d/a

ctiv

eC

ycle

Time Stamp (60ms)

One Alexnet program running

SM_active_cycles GPU_elapsed_cycle

• We used cupti to capture GPU elapsed cycle and SM_active_cycle


Explore multi-process GPUsharing


0.00E+00

2.00E+07

4.00E+07

6.00E+07

8.00E+07

1.00E+08

1.20E+08

1.40E+08

1.60E+081 3 5 7 9 11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

53 55 57

59

61

63

65

67

69 71 73

75

77

79

81

83

85 87 89

91

93

95

97

99

10

1

GP

UEl

apse

d/a

ctiv

eC

ycle

Time Stamp (60ms)

2 Concurrent Tensorflow Alexnet Inference application share GPU

Task1_elapsed_cycle Task2_elapsed_cycle

Two process time-share GPUresource, NOT concurrent

0.00E+00

2.00E+07

4.00E+07

6.00E+07

8.00E+07

1.00E+08

1.20E+08

1.40E+08

1.60E+08

1 3 5 7 9

11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99

10

1

2 Concurrent Tensorflow Alexnet Inference application share GPU

Task1_active_cycle Task1_elapsed_cycle Task2_active_cycle Task2_elapsed_cycle

GP

UEl

apse

d/a

ctiv

eC

ycle

Val

ue

Time Stamp (60 MS)

Explore multi-process GPUsharing

GPU active cycles doesn’t overlap, so noperf impact when running together


Running two inference servers

• Single GPU V100 system

• Model• resnet50

• Inception

• Batchsize 4


Performance with two inferenceservers


No QOScontrol

Perf overhead

Nvidia inference server

• Each model is implemented as GPUstream, can be run concurrently onGPU

• better control with scheduler


Nvidia inference server


1 server vs 2 serverperformance comparison


Agenda




Why


DGX2

DGX1 1U GPU server

• Each AI application has differentCPU/GPU ratio

• many types GPU servers

• Single GPU card and GPU server isbecoming so powerful, cannot befully utilized

Distributed storage system-bigsuccess


• seperated CPU(client) andstorage(chunk server)

• Stable• High performance• Easy to maintain• Low cost

distributed acceleration

Ethernethttp/grpc/rdma/ib

CPU poolWithoutGPU

Accelerationpool

StoragePool

HDD SSD

PhysicalMachine

VirtualMachine

Container

GPU pool

Distributed storage pool

Distributed computingpool

NVME


GPU pool over ethernet

CUDA Applicationlike TensorFlow (GPU version)

CUDARuntime

Proxy

CUDA Libraries(like cuDNN/

cuBLAS)

Proxy

CUDADriver

Proxy

Communicator

Client Server

Communicator

API Service

Real CUDA Runtime/Library/Driver

GPU

TCP/

RD

MA

Components can be shared by GPU/FPGA/etc.

Components can be generated automatically with minimum manual work


Challenge 1-A lot of CUDAAPI

• A intermediate and auto-generated YAML file to define the APIs that need to be forwarded to remote server.

• Need to add some information on pointer arguments manually


Challenge 2-Host memory

• With remote GPU, we need to detect such memories in CUDA kernel parameters to make sure the function correctness• cudaMallocHost


Challenge 3 - Hide API forwarding Latency• Changing blocking API calls to asynchronous calls

• Benefit: 2x speedup

API run time in server

API run time in client


BeforeAfter

• website：https://aimatrix.ai

• Code repo：https://github.com/alibaba/ai-matrix

CNN Training Performance


Summary



• All results are preliminary and we are actively working on it

• Discussions and collaborations are welcome

• Email: [email protected]


Thanks!Q&A


Documents

GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29