Kube-Knots: Resource Harvesting through Dynamic Container ...pxt176/publications/kube-knots-slides.pdf · 1 2 4 8 16 32 64 128 Inference Batch Sizes 0 20 40 60 80 100 % GPU Memory

Kube-Knots: Resource Harvesting through Dynamic Container Orchestration

in GPU-based Datacenters

Prashanth Thinakaran, Jashwant Raj Gunasekaran, Bikash Sharma, Chita Das, Mahmut Kandemir

September 25th, IEEE CLUSTER’19

Motivation

2

1 https://openai.com/blog/ai-and-compute/ 2 Schwartz, Roy, et al. "Green AI." arXiv preprint arXiv:1907.10597 (2019)

Pre GPU

Sub-PF GPU Training

Algorithmic Parallelism & TPUs

http://openai.com/blog/ai-and-compute/

Motivation

3

1 https://openai.com/blog/ai-and-compute/ 2 Schwartz, Roy, et al. "Green AI." arXiv preprint arXiv:1907.10597 (2019)

Pre GPU

Sub-PF GPU Training

Algorithmic Parallelism & TPUs

• Increasing compute demands for DNN training

• Modern GPGPUs bridge the compute gap ~10 TFlops

• GPU Utilization efficiency is 33%

Most of the contribution was on improving accuracy but not resource efficiency!!!

Kube-Knots focus on Green AI (Efficiency)

instead of Red AI (Accuracy)

http://openai.com/blog/ai-and-compute/

Outline

4

• Need for GPU resource harvesting

• Cluster workload setup

• Kube-Knots architecture

• Correlation Based Provisioning and Peak Prediction

• Results - Real system & Scalability study

• Conclusion

Energy Proportionality

5

Need for GPU bin-packing

6

• CPUs operate at peak efficiency for average load cases

• GPUs have linear performance per watt scaling

• Crucial to pack and use GPUs at 100% Utilization

• A real data-center scenario!

Alibaba: Study of Over-commitment

7

• Average CPU Utilization ~ 47%

• Average Mem Utilization ~ 76%

• Half of the scheduled containers consume < 45% of memory

• Containers are provisioned for peak utilization in datacenters

• Under-utilization epidemic!

Harvesting spare compute and memory

8

Under-utilization calls for resource harvesting at the cluster scheduler level

CPUs vs GPUs

9

• CPUs have mature docker / hypervisor layers for efficient resource management.

• Enforcing bin-packing is the known solution

• GPUs have limited support for virtualization.

• Context switches overheads (VIPT Vs VIVT)

• Agnostic scheduling leads to QoS violations

• Energy proportional scheduling calls for a novel approach

Workload heterogeneity

10

• Two different types of workload in GPU-based datacenters

• Batch workloads: HPC, DL Training, etc.,

• Long running: typically hours and days

• Latency-sensitive workloads: DL Inference, etc.,

• Short-lived: in milli-seconds to few seconds

How to Harvest Spare Cycles

11

Can provision for only average case utilization conservatively ~80% of the asking!

But in case of peaks how to resize them back?

Are there any early markers to harvest spare cycles?

Correlation of resource metrics: Alibaba

12

Latency-sensitive Workload

Batch/Long-running Workload

Predictable load over time

Tightly correlated

metricsNo solid

leads

Opportunities for harvesting in batch

13

• Phase changes are predictable

• I/O peaks are succeeded by memory peaks

• Average consumption is low when compared to peaks

• Provisioning for peak leads to over-commitment

TensorFlow Inference on GPUs

14

1 2 4 8 16 32 64 128Inference Batch Sizes

0

20

40

60

80

100%

GPU

Mem

ory

Use

d TFfaceimckeynerposchk

TensorFlow Inference on GPUs

15

1 2 4 8 16 32 64 128Inference Batch Sizes

0

20

40

60

80

100

% G

PU M

emor

y U

sed TF

faceimckeynerposchk

• Inference Queries are latency-sensitive ~ 200ms.

• Consumes < 10% of GPU.

• With batching can be pushed up to 30%.

• Usually when run inside TF, the GPU memory cannot be harvested.

Outline

16


• Cluster Workload setup


• Correlation based Provisioning and Peak Prediction


• Conclusion

Cluster-level workload setup

17

• Eight Rodinia (HPC) GPU applications

• Batch and long running tasks

• Djinn and Tonic suite’s DNN inference Queries

• Face recognition, key points detection, speech recognition

• We characterize and associate them in three different bins

• Plot the COV of GPU Utilization

• COV <= 1 Static load and not much variation

• COV > 1 Heavy tailed and highly varying load

App-

Mix

-1Ap

p-M

ix-2

App-

Mix

-3

Baseline GPU Agnostic Scheduler

18

• Ideal scheduler would strive to improve the GPU utilization in all percentiles.

• In case of high COV, the cluster utilization is not stable.

• Applications have varying resource needs throughout.

• Keeping a GPU cluster busy throughout depends on COV mixes.

• GPU Agnostic scheduler leads to QoS violations due to load imbalance.

App-Mix-1 App-Mix-2 App-Mix-3

Outline

19






• Conclusion

Kube-Knots Design

20

Outline

21






• Conclusion

Correlation Based Provisioning

22

• Correlation between utilization metrics is considered for application placement.

• Two positively correlating pods for memory is not colocated together on the same GPU

• Pods are always resized for average utilization and not peak utilization.

• GPUs are still underutilized due to static provisioning.

• QoS violations due to pending pods as most of them contend for same resource (+ve Correlation)

Peak Prediction Scheduler

23

• PP allows two +vely correlating pods to be on same GPU.

• PP is built on first principle that, resource peaks do not happen at the same time for all co-located apps.

• PP uses ARIMA to predict peak utilization to resize the pods.

• Autocorrelation function predicts the subsequent resource demand trends.

• Where n is the total number of events, ȳ is the moving average

• When the r value is > 0, we use ARIMA to forecast the resource utilization.

Outline

24



• Kube-Knots Architecture


• Results - Real System & Scalability Study

• Conclusion

CBP+PP Utilization Improvements

25

• CBP+PP does an effective load consolidation in case of high & medium loads when compared to GPU-Agnostic scheduler

• 62% improvement in average utilization.

• 80% improvement for median and 99%ile

• In case of low and sporadic load scenario, CBP+PP effectively consolidated loads to active GPUs.

• GPU nodes 1, 4, 8, 10 are minimally used due to power efficiency.

App-Mix-1 App-Mix-2 App-Mix-3

GPU Utilization Breakdown

26

• CBP+PP consistently improved utilization in all cases.

• By up to 80% for median and tail

• In case of low load scenarios, the scope for improvements is low.

• Still CBP+PP improved in average case.

App-

Mix

-1Ap

p-M

ix-2

App-

Mix

-3

Power & QoS Improvements

27

• Res-Ag consumes least power on an average of 33%

• Violates QoS for 53% of requests

• PP consumes 10% more than Res-Ag

• Ensures QoS for almost 100% of requests

• CBP+PP can ensure QoS by predicting the GPU resource peaks

• Further power savings is due to consolidation on active GPUs

Scalability of CBP+PP in case of DL

28

• Deep Learning Training and Inference workload mixes.

• 60% faster median JCT compared to DL-aware schedulers.

• 30% better than Gandiva.

• 11% better than Tiresias.

• QoS guarantees of DLI in presence of DLT

• Reduced QoS violations due to GPU-utilization aware placement.

Conclusion

29

• Need for resource harvesting in GPU-datacenters.

• Exposing GPU real-time utilization to Kubernetes through Knots.

• CBP+PP Scheduler improved GPU Utilization by up to 80% for both average and tail-case utilization.

• QoS aware workload consolidation lead to 33% energy savings.

• Trace-driven scalability experiments show that Kube-Knots performs 36% better in term of JCT compared to DLT schedulers.

• Kube-Knots also reduced the overall QoS violations by up to 53%.

[email protected]

http://www.cse.psu.edu/hpcl/index.html

September 25th, IEEE CLUSTER’19

“Workload Setup Docker TensorFlow / HPC experiments used in evaluation of kube-knots,”

https://hub.docker.com/r/prashanth5192/gpu

mailto:[email protected]

http://www.cse.psu.edu/hpcl/index.html

Backup-1 Cluster Status COV

31

• COV of loads across different GPUs

• 0 to 0.2 range, effectively reduced form 0.1 to 0.7.

• PP performs load balancing even in case of high-load scenarios.

• PP also harvests and consolidates in low-loads by keeping idle GPUs in p_state 12

Difference Table

Uniform Kubernetes default Scheduler

GPUs cannot be shared

Low PPW and No QoS guarantees

Resource Agnostic Sharing First Fit Decreasing bin-packing

High PPW

Poor QoS and high queueing delays

Correlation Based Provisioning Utilization metrics based bin-packing

High PPW

Assured QoS but high queueing delays due to affinity constraints

Peak Prediction Predicts the resource peaks of co-scheduled apps by Auto Correlation

Factor

High PPW and Assured QoS guarantees

Documents

Kube-Knots: Resource Harvesting through Dynamic Container ...pxt176/publications/kube-knots-slides.pdf · 1 2 4 8 16 32 64 128 Inference Batch Sizes 0 20 40 60 80 100 % GPU Memory