Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Kube-Knots: Resource Harvesting through Dynamic Container Orchestration
in GPU-based Datacenters
Prashanth Thinakaran, Jashwant Raj Gunasekaran, Bikash Sharma, Chita Das, Mahmut Kandemir
September 25th, IEEE CLUSTER’19
Motivation
2
1 https://openai.com/blog/ai-and-compute/ 2 Schwartz, Roy, et al. "Green AI." arXiv preprint arXiv:1907.10597 (2019)
Pre GPU
Sub-PF GPU Training
Algorithmic Parallelism & TPUs
Motivation
3
1 https://openai.com/blog/ai-and-compute/ 2 Schwartz, Roy, et al. "Green AI." arXiv preprint arXiv:1907.10597 (2019)
Pre GPU
Sub-PF GPU Training
Algorithmic Parallelism & TPUs
• Increasing compute demands for DNN training
• Modern GPGPUs bridge the compute gap ~10 TFlops
• GPU Utilization efficiency is 33%
Most of the contribution was on improving accuracy but not resource efficiency!!!
Kube-Knots focus on Green AI (Efficiency)
instead of Red AI (Accuracy)
Outline
4
• Need for GPU resource harvesting
• Cluster workload setup
• Kube-Knots architecture
• Correlation Based Provisioning and Peak Prediction
• Results - Real system & Scalability study
• Conclusion
Energy Proportionality
5
Need for GPU bin-packing
6
• CPUs operate at peak efficiency for average load cases
• GPUs have linear performance per watt scaling
• Crucial to pack and use GPUs at 100% Utilization
• A real data-center scenario!
Alibaba: Study of Over-commitment
7
• Average CPU Utilization ~ 47%
• Average Mem Utilization ~ 76%
• Half of the scheduled containers consume < 45% of memory
• Containers are provisioned for peak utilization in datacenters
• Under-utilization epidemic!
Harvesting spare compute and memory
8
Under-utilization calls for resource harvesting at the cluster scheduler level
CPUs vs GPUs
9
• CPUs have mature docker / hypervisor layers for efficient resource management.
• Enforcing bin-packing is the known solution
• GPUs have limited support for virtualization.
• Context switches overheads (VIPT Vs VIVT)
• Agnostic scheduling leads to QoS violations
• Energy proportional scheduling calls for a novel approach
Workload heterogeneity
10
• Two different types of workload in GPU-based datacenters
• Batch workloads: HPC, DL Training, etc.,
• Long running: typically hours and days
• Latency-sensitive workloads: DL Inference, etc.,
• Short-lived: in milli-seconds to few seconds
How to Harvest Spare Cycles
11
Can provision for only average case utilization conservatively ~80% of the asking!
But in case of peaks how to resize them back?
Are there any early markers to harvest spare cycles?
Correlation of resource metrics: Alibaba
12
Latency-sensitive Workload
Batch/Long-running Workload
Predictable load over time
Tightly correlated
metricsNo solid
leads
Opportunities for harvesting in batch
13
• Phase changes are predictable
• I/O peaks are succeeded by memory peaks
• Average consumption is low when compared to peaks
• Provisioning for peak leads to over-commitment
TensorFlow Inference on GPUs
14
1 2 4 8 16 32 64 128Inference Batch Sizes
0
20
40
60
80
100%
GPU
Mem
ory
Use
d TFfaceimckeynerposchk
TensorFlow Inference on GPUs
15
1 2 4 8 16 32 64 128Inference Batch Sizes
0
20
40
60
80
100
% G
PU M
emor
y U
sed TF
faceimckeynerposchk
• Inference Queries are latency-sensitive ~ 200ms.
• Consumes < 10% of GPU.
• With batching can be pushed up to 30%.
• Usually when run inside TF, the GPU memory cannot be harvested.
Outline
16
• Need for GPU resource harvesting
• Cluster Workload setup
• Kube-Knots architecture
• Correlation based Provisioning and Peak Prediction
• Results - Real system & Scalability study
• Conclusion
Cluster-level workload setup
17
• Eight Rodinia (HPC) GPU applications
• Batch and long running tasks
• Djinn and Tonic suite’s DNN inference Queries
• Face recognition, key points detection, speech recognition
• We characterize and associate them in three different bins
• Plot the COV of GPU Utilization
• COV <= 1 Static load and not much variation
• COV > 1 Heavy tailed and highly varying load
App-
Mix
-1Ap
p-M
ix-2
App-
Mix
-3
Baseline GPU Agnostic Scheduler
18
• Ideal scheduler would strive to improve the GPU utilization in all percentiles.
• In case of high COV, the cluster utilization is not stable.
• Applications have varying resource needs throughout.
• Keeping a GPU cluster busy throughout depends on COV mixes.
• GPU Agnostic scheduler leads to QoS violations due to load imbalance.
App-Mix-1 App-Mix-2 App-Mix-3
Outline
19
• Need for GPU resource harvesting
• Cluster workload setup
• Kube-Knots architecture
• Correlation Based Provisioning and Peak Prediction
• Results - Real system & Scalability study
• Conclusion
Kube-Knots Design
20
Outline
21
• Need for GPU resource harvesting
• Cluster workload setup
• Kube-Knots architecture
• Correlation Based Provisioning and Peak Prediction
• Results - Real system & Scalability study
• Conclusion
Correlation Based Provisioning
22
• Correlation between utilization metrics is considered for application placement.
• Two positively correlating pods for memory is not colocated together on the same GPU
• Pods are always resized for average utilization and not peak utilization.
• GPUs are still underutilized due to static provisioning.
• QoS violations due to pending pods as most of them contend for same resource (+ve Correlation)
Peak Prediction Scheduler
23
• PP allows two +vely correlating pods to be on same GPU.
• PP is built on first principle that, resource peaks do not happen at the same time for all co-located apps.
• PP uses ARIMA to predict peak utilization to resize the pods.
• Autocorrelation function predicts the subsequent resource demand trends.
• Where n is the total number of events, ȳ is the moving average
• When the r value is > 0, we use ARIMA to forecast the resource utilization.
Outline
24
• Need for GPU resource harvesting
• Cluster workload setup
• Kube-Knots Architecture
• Correlation Based Provisioning and Peak Prediction
• Results - Real System & Scalability Study
• Conclusion
CBP+PP Utilization Improvements
25
• CBP+PP does an effective load consolidation in case of high & medium loads when compared to GPU-Agnostic scheduler
• 62% improvement in average utilization.
• 80% improvement for median and 99%ile
• In case of low and sporadic load scenario, CBP+PP effectively consolidated loads to active GPUs.
• GPU nodes 1, 4, 8, 10 are minimally used due to power efficiency.
App-Mix-1 App-Mix-2 App-Mix-3
GPU Utilization Breakdown
26
• CBP+PP consistently improved utilization in all cases.
• By up to 80% for median and tail
• In case of low load scenarios, the scope for improvements is low.
• Still CBP+PP improved in average case.
App-
Mix
-1Ap
p-M
ix-2
App-
Mix
-3
Power & QoS Improvements
27
• Res-Ag consumes least power on an average of 33%
• Violates QoS for 53% of requests
• PP consumes 10% more than Res-Ag
• Ensures QoS for almost 100% of requests
• CBP+PP can ensure QoS by predicting the GPU resource peaks
• Further power savings is due to consolidation on active GPUs
Scalability of CBP+PP in case of DL
28
• Deep Learning Training and Inference workload mixes.
• 60% faster median JCT compared to DL-aware schedulers.
• 30% better than Gandiva.
• 11% better than Tiresias.
• QoS guarantees of DLI in presence of DLT
• Reduced QoS violations due to GPU-utilization aware placement.
Conclusion
29
• Need for resource harvesting in GPU-datacenters.
• Exposing GPU real-time utilization to Kubernetes through Knots.
• CBP+PP Scheduler improved GPU Utilization by up to 80% for both average and tail-case utilization.
• QoS aware workload consolidation lead to 33% energy savings.
• Trace-driven scalability experiments show that Kube-Knots performs 36% better in term of JCT compared to DLT schedulers.
• Kube-Knots also reduced the overall QoS violations by up to 53%.
http://www.cse.psu.edu/hpcl/index.html
September 25th, IEEE CLUSTER’19
“Workload Setup Docker TensorFlow / HPC experiments used in evaluation of kube-knots,”
https://hub.docker.com/r/prashanth5192/gpu
Backup-1 Cluster Status COV
31
• COV of loads across different GPUs
• 0 to 0.2 range, effectively reduced form 0.1 to 0.7.
• PP performs load balancing even in case of high-load scenarios.
• PP also harvests and consolidates in low-loads by keeping idle GPUs in p_state 12
Difference Table
Uniform Kubernetes default Scheduler
GPUs cannot be shared
Low PPW and No QoS guarantees
Resource Agnostic Sharing First Fit Decreasing bin-packing
High PPW
Poor QoS and high queueing delays
Correlation Based Provisioning Utilization metrics based bin-packing
High PPW
Assured QoS but high queueing delays due to affinity constraints
Peak Prediction Predicts the resource peaks of co-scheduled apps by Auto Correlation
Factor
High PPW and Assured QoS guarantees