30
GPU Resource Pooling and the benefit of deploying CUPTI Alibaba Group Lingling Jin, Lingjie Xu © 2019 Alibaba Group

GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

GPU Resource Pooling and the

benefit of deploying CUPTIAlibaba Group

Lingling Jin, Lingjie Xu

© 2019 Alibaba Group

Page 2: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Alibaba is an AI company

© 2019 Alibaba Group

• E-commerce• Search• Advertisement

• Financing &payment

• Video• Logistics

All these services arebacked up by Alibabacloud

Page 3: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

GPU pool

IDC data center

© 2019 Alibaba Group

• GPU accelerator powereddata center

Page 4: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

GPU servers are managedelastically

© 2019 Alibaba Group

Kubernetes

KVM

However GPUutilization is still low

Page 5: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Agenda

• GPU accelerated inference service pool: multi-tenant

• GPU resource pool over ethernet

© 2019 Alibaba Group

Page 6: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

GPU inference service pool

© 2019 Alibaba Group

Model Zoo

• GPUs are allocatedin single cardgranularity

• Capacity isdesigned for worstcase traffic

• Mixing differentinference possible?

Model1

Model1

Model2

Page 7: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Experiment setup

• Single GPU V100 system

• Model: resnet50

• Start inference server, capture NVML GPU utilization at backend

• Client performance test• multiple concurrent request,

• max allowed latency

• desired batchsize

• Calculate server throughput

© 2019 Alibaba Group

Page 8: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

NVML GPU utilization

© 2019 Alibaba Group

GPU_utilization: Percent of time over the past

sample period during which one or more

kernels was executing on the GPU

Page 9: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

CUPTI

• The CUDA Profiling Tools Interface (CUPTI) enables the creation of profiling and tracing tools that target CUDA applications.

• all event/metrics available in NVPROF• sm_efficiency

• achieved_occupancy

• dram_read/write

• Single/double/half_precision_fu_utilization

• Collect GPU metrics during runtime

• 1%-10% overhead

© 2019 Alibaba Group

Page 10: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

SM efficiency: The percentage of time at least one

warp is active on a multiprocessor averaged over all

multiprocessors on the GPU.

CUPTI is what we want.

CUPTI SM Efficiency

© 2019 Alibaba Group

Page 11: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Explore two processes GPUsharing

0.00E+00

1.00E+07

2.00E+07

3.00E+07

4.00E+07

5.00E+07

6.00E+07

7.00E+07

1 6

11

16

21

26

31

36

41

46

51

56

61

66

71

76

81

86

91

96

10

1

10

6GP

UEl

apse

d/a

ctiv

eC

ycle

Time Stamp (60ms)

One Alexnet program running

SM_active_cycles GPU_elapsed_cycle

• We used cupti to capture GPU elapsed cycle and SM_active_cycle

© 2019 Alibaba Group

Page 12: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Explore multi-process GPUsharing

© 2019 Alibaba Group

0.00E+00

2.00E+07

4.00E+07

6.00E+07

8.00E+07

1.00E+08

1.20E+08

1.40E+08

1.60E+081 3 5 7 9 11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

53 55 57

59

61

63

65

67

69 71 73

75

77

79

81

83

85 87 89

91

93

95

97

99

10

1

GP

UEl

apse

d/a

ctiv

eC

ycle

Time Stamp (60ms)

2 Concurrent Tensorflow Alexnet Inference application share GPU

Task1_elapsed_cycle Task2_elapsed_cycle

Two process time-share GPUresource, NOT concurrent

Page 13: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

0.00E+00

2.00E+07

4.00E+07

6.00E+07

8.00E+07

1.00E+08

1.20E+08

1.40E+08

1.60E+08

1 3 5 7 9

11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99

10

1

2 Concurrent Tensorflow Alexnet Inference application share GPU

Task1_active_cycle Task1_elapsed_cycle Task2_active_cycle Task2_elapsed_cycle

GP

UEl

apse

d/a

ctiv

eC

ycle

Val

ue

Time Stamp (60 MS)

Explore multi-process GPUsharing

GPU active cycles doesn’t overlap, so noperf impact when running together

© 2019 Alibaba Group

Page 14: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Running two inference servers

• Single GPU V100 system

• Model• resnet50

• Inception

• Batchsize 4

© 2019 Alibaba Group

Page 15: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Performance with two inferenceservers

© 2019 Alibaba Group

No QOScontrol

Perf overhead

Page 16: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Nvidia inference server

• Each model is implemented as GPUstream, can be run concurrently onGPU

• better control with scheduler

© 2019 Alibaba Group

Page 17: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Nvidia inference server

© 2019 Alibaba Group

Page 18: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

1 server vs 2 serverperformance comparison

© 2019 Alibaba Group

Page 19: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Agenda

• GPU accelerated inference service pool: multi-tenant

• GPU resource pool over ethernet

© 2019 Alibaba Group

Page 20: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Why

© 2019 Alibaba Group

DGX2

DGX1 1U GPU server

• Each AI application has differentCPU/GPU ratio

• many types GPU servers

• Single GPU card and GPU server isbecoming so powerful, cannot befully utilized

Page 21: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Distributed storage system-bigsuccess

© 2019 Alibaba Group

• seperated CPU(client) andstorage(chunk server)

• Stable• High performance• Easy to maintain• Low cost

Page 22: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

distributed acceleration

Ethernethttp/grpc/rdma/ib

CPU poolWithoutGPU

Accelerationpool

StoragePool

HDD SSD

PhysicalMachine

VirtualMachine

Container

GPU pool

Distributed storage pool

Distributed computingpool

NVME

© 2019 Alibaba Group

Page 23: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

GPU pool over ethernet

CUDA Applicationlike TensorFlow (GPU version)

CUDARuntime

Proxy

CUDA Libraries(like cuDNN/

cuBLAS)

Proxy

CUDADriver

Proxy

Communicator

Client Server

Communicator

API Service

Real CUDA Runtime/Library/Driver

GPU

TCP/

RD

MA

Components can be shared by GPU/FPGA/etc.

Components can be generated automatically with minimum manual work

© 2019 Alibaba Group

Page 24: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Challenge 1-A lot of CUDAAPI

• A intermediate and auto-generated YAML file to define the APIs that need to be forwarded to remote server.

• Need to add some information on pointer arguments manually

© 2019 Alibaba Group

Page 25: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Challenge 2-Host memory

• With remote GPU, we need to detect such memories in CUDA kernel parameters to make sure the function correctness• cudaMallocHost

© 2019 Alibaba Group

Page 26: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Challenge 3 - Hide API forwarding Latency• Changing blocking API calls to asynchronous calls

• Benefit: 2x speedup

API run time in server

API run time in client

© 2019 Alibaba Group

BeforeAfter

Page 27: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

• website:https://aimatrix.ai

• Code repo:https://github.com/alibaba/ai-matrix

Page 28: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

CNN Training Performance

© 2019 Alibaba Group

Page 29: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Summary

• GPU accelerated inference service pool: multi-tenant

• GPU resource pool over ethernet

• All results are preliminary and we are actively working on it

• Discussions and collaborations are welcome

• Email: [email protected]

© 2019 Alibaba Group

Page 30: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Thanks!Q&A

© 2019 Alibaba Group