Upload
others
View
2
Download
1
Embed Size (px)
Citation preview
Heracles: Improving Resource
Efficiency at Scale
David Lo†, Liqun Cheng*, Rama Govindaraju*, Parthasarathy Ranganathan*, Christos Kozyrakis†
† Stanford University * Google Inc.
© 2012 Google Inc. All rights reserved. Google and the Google Logo are registered trademarks of Google Inc.
The case for oversubscription
Diurnal load variation Total Cost of Ownership
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 2
61%16%
14%
6%
3%
Servers
Energy
Cooling
Networking
Other
[J. Hamilton, http://mvdirona.com]
Idleness in
latency
critical
workload! Bigger
OpportunityPEGASUS
[ISCA’14]
Oversubscription summary
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 3
Motivation: fill in idle cycles
with useful work
How: Latency Critical (LC) +
Best Effort (BE)
Plenty of analytics jobs, such
as deep learning training
Challenges of oversubscription
Allocation of shared resources between LC and BE
Interference on shared resources
DRAM
LLC
Cores
Network
Power
Difficult to guarantee quality of service (QoS)
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 4
How bad can interference get?
Quick experiment with a latency critical job and a batch job
The latency critical job: Google websearch
The batch job: deep learning classifier
The setup:
Run batch job at very low priority to fill in idle CPU cycles
Hope that the Linux scheduler is sufficient for QoS
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 5
How bad can interference get?
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 6
SLO latency
Cannot co-locate
workload at any
load!
Interference is different based on resource
LLC >300% >300% >300% >300% >300% >300% >300% 264% 123%
DRAM >300% >300% >300% >300% >300% >300% >300% 270% 122%
HyperThread 110% 107% 114% 115% 105% 117% 120% 136% >300%
CPU power 124% 107% 116% 109% 115% 105% 101% 100% 100%
Network 36% 36% 37% 37% 39% 42% 48% 55% 64%
10% 20% 30% 40% 50% 60% 70% 80% 90%
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 7
Impact of interference on websearch’s latency
Re
sou
rce
Websearch load
0%
100%
300%
O
K
N
O
T
O
K
No oversubscription
possible
Interference is different based on resource
LLC >300% >300% >300% >300% >300% >300% >300% 264% 123%
DRAM >300% >300% >300% >300% >300% >300% >300% 270% 122%
HyperThread 110% 107% 114% 115% 105% 117% 120% 136% >300%
CPU power 124% 107% 116% 109% 115% 105% 101% 100% 100%
Network 36% 36% 37% 37% 39% 42% 48% 55% 64%
10% 20% 30% 40% 50% 60% 70% 80% 90%
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 8
Impact of interference on websearch’s latency
Re
sou
rce
Websearch load
0%
100%
300%
Need to manage MULTIPLE resources
with DYNAMIC controller
Oversubscription appears to be too hard
Google Twitter
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 9
[Barroso’09] [Delimitrou’14]
Even with cluster managers and lots of available jobs
Caused by fear of interference
20% avg. utilization30% avg. utilization
Heracles: low latency and high utilization
Insights:
Use iso-latency to tolerate some interference
Fine-grained isolation on all shared resources to mitigate the rest
Implementation:
Dynamic controller to manage shared resource allocations
Evaluated on Google workloads, high utilization without QoS
violations
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 10
0% 20% 40% 60% 80% 100%
Ov
era
ll q
ue
ry la
ten
cy
% of maximum cluster load
websearch latency vs. cluster load
What is iso-latency?
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 11
Can hide interference in this slack!SLO latency
Fine-grained resource isolation mechanisms
CPU (HyperThread/L1/L2)
Use Linux cpuset cgroups to partition cores between LC and BE jobs
Single core granularity (Haswell has up to 18 cores)
~1ms response time
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 12
Core 1 Core 2 Core 3 Core 4 ... Core N-1 Core N
LC cpuset BE
Example partitioning setup:
Fine-grained resource isolation mechanisms
LLC
Hardware cache partitioning in latest Haswell Xeon
Partitioning by cache way (20 ways in Haswell)
<1ms adjustment latency
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 13
Way 1 Way 2 Way 3 Way 4 ... Way N-1 Way N
Partition for LC BE
Example partitioning setup:
Fine-grained resource isolation mechanisms
Network
Transmit rate limiting in Linux kernel with hierarchical token bucket
Extremely fine grained limits of at least 1Mbps
~1ms response time
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 14
BE pkt
BE pkt
BE pkt
LC pkt
LC pkt
BE queue LC queue
Pkt
Sched
To NIC
Rate limit BE flows
Fine-grained resource isolation mechanisms
CPU power
Per-core DVFS to ensure minimum Turbo frequency for LC workload
Can change clock frequency in increments of 100MHz
<1 ms response of hardware
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15
Core 1 Core 2 Core 3 Core 4 ... Core N-1 Core N
3.0 GHz 3.0 GHz 3.0 GHz 3.0 GHz 3.0 GHz 2.0GHz 2.0GHz
BE coresLC cores
Shift power from BE to LC cores
to maintain guaranteed LC freq.
Fine-grained resource isolation mechanisms
DRAM BW
Not available in hardware, have to simulate with other mechanisms
LLC partitions influences amount of traffic that is served by DRAM
Use number of cores to control DRAM BW
Intuition: each core can only issue so many requests/sec
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 16
LC
LC
BE
BE
LLC partitioning
Core partitioning
× DRAM BW
BWPerCore
NumCores
But how should the knobs be set?
This looks like an optimization problem
Objective: maximize resources given to BE job
Constraints: preserve SLO of latency critical application
Challenge: 5-dimensional formulation!
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 17
Control insight #1: independence
Observation: latency violations occur when a shared
resource is extremely loaded
High demand for resource causes significant contention
LC workload is unable to obtain its required allocation
Insight: assume independent interference under 2 conditions
LC workload is not starved for any resource
Each resource has enough slack (~10%) to absorb bursts
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 18
Control insight #2: convexity
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 19
Performance as a function
of resources is convex for
benchmarked workloads
Use of gradient descent is
guaranteed to produce
optimality
Heracles: high level controller overview
Goal: meet SLO, keep BE from saturating shared resource
Runs on each machine
LC
workload
Controller
CPU +
Memory
CPU
powerNetwork
LLC CPUDRAM
BWDVFS
CPU
PowerHTB
Net.
BW
Latency readings
Can BE grow?
Internal
feedback
loops
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 20
LC
workload
Controller
CPU +
Memory
CPU
powerNetwork
LLC CPUDRAM
BWDVFS
CPU
PowerHTB
Net.
BW
Latency readings
Can BE grow?
Internal
feedback
loops
Heracles: high level controller overview
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 21
Cores LLC Core freq. Network BW
LC LC Max LCBE BE BE BE
L
BE BEBE BEBE BEBE
LC
workload
Controller
CPU +
Memory
CPU
powerNetwork
LLC CPUDRAM
BWDVFS
CPU
PowerHTB
Net.
BW
Latency readings
Can BE grow?
Internal
feedback
loops
Heracles: high level controller overview
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 22
Cores LLC Core freq. Network BW
LC LC Max LCBE BE BE BE
L
Example subcontroller: Core+Memory
Isolates: Cores, LLC, DRAM
Physical mechanisms: Partitioning of cores, LLC, and DRAM
Goal: maximize cores running BE job by minimizing DRAM BW
Guardband in DRAM BW to ensure LC job is not being starved
Iterative phases:
1. Reduce total DRAM BW through LLC partitioning
2. Grow allocation of BE cores
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 23
Example subcontroller: Core+Memory
Time
LCD
RA
M B
W
Time
BE D
RA
M B
W
Time
Tota
l D
RA
M B
W
Start here
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 24
Example subcontroller: Core+Memory
Time
LCD
RA
M B
W
Time
BE D
RA
M B
W
Time
Tota
l D
RA
M B
W
Reduce BW
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 25
Example subcontroller: Core+Memory
Time
LCD
RA
M B
W
Time
BE D
RA
M B
W
Time
Tota
l D
RA
M B
W
Reduce BW
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 26
∇≈ 0Negligible benefit
Example subcontroller: Core+Memory
Time
LCD
RA
M B
W
Time
BE D
RA
M B
W
Time
Tota
l D
RA
M B
W
+ BE cores
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 27
Example subcontroller: Core+Memory
Time
LCD
RA
M B
W
Time
BE D
RA
M B
W
Time
Tota
l D
RA
M B
W
+ BE cores
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 28
Danger zone
Example subcontroller: Core+Memory
Time
LCD
RA
M B
W
Time
BE D
RA
M B
W
Time
Tota
l D
RA
M B
W
Reduce BW
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 29
Example subcontroller: Core+Memory
Time
LCD
RA
M B
W
Time
BE D
RA
M B
W
Time
Tota
l D
RA
M B
W
+ BE cores
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 30
Hit BW cap
Evaluation of Heracles
Evaluation of Google production workloads on real hardware
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 31
Latency Critical workloads
websearch
Leaf node, document retrieval/scoring
99%-ile latency SLO of tens of milliseconds
ml_cluster
Machine learning for text clustering
95%-ile latency SLO of tens of milliseconds
memkeyval
In-memory key-value store
99%-ile latency SLO of hundreds of microseconds
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 32
Production
Best Effort jobs
stream-LLC: LLC antagonist
stream-DRAM: DRAM BW antagonist
cpu_pwr: CPU power antagonist
brain: deep learning (LLC, DRAM, CPU, CPU power)
streetview: image stitching (DRAM BW)
Run Heracles on real hardware, measure latency and utilization
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 33
Synthetic
Production
Latency validation: do no harm
SLO latency
Iso-latency: recovering
slack and turning it into work
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 34
Putting it together: resource efficiency
Effective Machine Utilization = (LC load) + (% BE throughput)
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 35
Load on LC app
Free batch processing
capability
Putting it together: resource efficiency
Effective Machine Utilization = (LC load) + (% BE throughput)
Better than 100% is due to better
binpacking
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 36
Bonus: energy efficiency too!
Power increase is far less than resource utilization increase!
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 37
300% more work for
60% more power
Cluster results
Use load trace for off-peak hours on websearch cluster
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 38
Conclusion
Increasing utilization is key to improving datacenter efficiency
Fine-grained knobs to control many sources of interference
Need coordinated policy to find optimal settings
Heracles significantly increases utilization
Achieves average of 90% utilization for Google workloads
Potential increase of >300% in cost efficiency
Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 39