Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Heracles: Improving Resource

Efficiency at Scale

David Lo†, Liqun Cheng*, Rama Govindaraju*, Parthasarathy Ranganathan*, Christos Kozyrakis†

† Stanford University * Google Inc.

© 2012 Google Inc. All rights reserved. Google and the Google Logo are registered trademarks of Google Inc.

The case for oversubscription

Diurnal load variation Total Cost of Ownership

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 2

61%16%

14%

6%

3%

Servers

Energy

Cooling

Networking

Other

[J. Hamilton, http://mvdirona.com]

Idleness in

latency

critical

workload! Bigger

OpportunityPEGASUS

[ISCA’14]

Oversubscription summary


Motivation: fill in idle cycles

with useful work

How: Latency Critical (LC) +

Best Effort (BE)

Plenty of analytics jobs, such

as deep learning training

Challenges of oversubscription

Allocation of shared resources between LC and BE

Interference on shared resources

DRAM

LLC

Cores

Network

Power

Difficult to guarantee quality of service (QoS)


How bad can interference get?

Quick experiment with a latency critical job and a batch job

The latency critical job: Google websearch

The batch job: deep learning classifier

The setup:

Run batch job at very low priority to fill in idle CPU cycles

Hope that the Linux scheduler is sufficient for QoS


How bad can interference get?


SLO latency

Cannot co-locate

workload at any

load!

Interference is different based on resource

LLC >300% >300% >300% >300% >300% >300% >300% 264% 123%

DRAM >300% >300% >300% >300% >300% >300% >300% 270% 122%

HyperThread 110% 107% 114% 115% 105% 117% 120% 136% >300%

CPU power 124% 107% 116% 109% 115% 105% 101% 100% 100%

Network 36% 36% 37% 37% 39% 42% 48% 55% 64%

10% 20% 30% 40% 50% 60% 70% 80% 90%


Impact of interference on websearch’s latency

Re

sou

rce

Websearch load

0%

100%

300%

O

K

N

O

T

O

K

No oversubscription

possible

Interference is different based on resource

LLC >300% >300% >300% >300% >300% >300% >300% 264% 123%

DRAM >300% >300% >300% >300% >300% >300% >300% 270% 122%

HyperThread 110% 107% 114% 115% 105% 117% 120% 136% >300%

CPU power 124% 107% 116% 109% 115% 105% 101% 100% 100%

Network 36% 36% 37% 37% 39% 42% 48% 55% 64%

10% 20% 30% 40% 50% 60% 70% 80% 90%


Impact of interference on websearch’s latency

Re

sou

rce

Websearch load

0%

100%

300%

Need to manage MULTIPLE resources

with DYNAMIC controller

Oversubscription appears to be too hard

Google Twitter


[Barroso’09] [Delimitrou’14]

Even with cluster managers and lots of available jobs

Caused by fear of interference

20% avg. utilization30% avg. utilization

Heracles: low latency and high utilization

Insights:

Use iso-latency to tolerate some interference

Fine-grained isolation on all shared resources to mitigate the rest

Implementation:

Dynamic controller to manage shared resource allocations

Evaluated on Google workloads, high utilization without QoS

violations


0% 20% 40% 60% 80% 100%

Ov

era

ll q

ue

ry la

ten

cy

% of maximum cluster load

websearch latency vs. cluster load

What is iso-latency?


Can hide interference in this slack!SLO latency

Fine-grained resource isolation mechanisms

CPU (HyperThread/L1/L2)

Use Linux cpuset cgroups to partition cores between LC and BE jobs

Single core granularity (Haswell has up to 18 cores)

~1ms response time


Core 1 Core 2 Core 3 Core 4 ... Core N-1 Core N

LC cpuset BE

Example partitioning setup:


LLC

Hardware cache partitioning in latest Haswell Xeon

Partitioning by cache way (20 ways in Haswell)

<1ms adjustment latency


Way 1 Way 2 Way 3 Way 4 ... Way N-1 Way N

Partition for LC BE

Example partitioning setup:


Network

Transmit rate limiting in Linux kernel with hierarchical token bucket

Extremely fine grained limits of at least 1Mbps

~1ms response time


BE pkt

BE pkt

BE pkt

LC pkt

LC pkt

BE queue LC queue

Pkt

Sched

To NIC

Rate limit BE flows


CPU power

Per-core DVFS to ensure minimum Turbo frequency for LC workload

Can change clock frequency in increments of 100MHz

<1 ms response of hardware


Core 1 Core 2 Core 3 Core 4 ... Core N-1 Core N

3.0 GHz 3.0 GHz 3.0 GHz 3.0 GHz 3.0 GHz 2.0GHz 2.0GHz

BE coresLC cores

Shift power from BE to LC cores

to maintain guaranteed LC freq.


DRAM BW

Not available in hardware, have to simulate with other mechanisms

LLC partitions influences amount of traffic that is served by DRAM

Use number of cores to control DRAM BW

Intuition: each core can only issue so many requests/sec


LC

LC

BE

BE

LLC partitioning

Core partitioning

× DRAM BW

BWPerCore

NumCores

But how should the knobs be set?

This looks like an optimization problem

Objective: maximize resources given to BE job

Constraints: preserve SLO of latency critical application

Challenge: 5-dimensional formulation!


Control insight #1: independence

Observation: latency violations occur when a shared

resource is extremely loaded

High demand for resource causes significant contention

LC workload is unable to obtain its required allocation

Insight: assume independent interference under 2 conditions

LC workload is not starved for any resource

Each resource has enough slack (~10%) to absorb bursts


Control insight #2: convexity


Performance as a function

of resources is convex for

benchmarked workloads

Use of gradient descent is

guaranteed to produce

optimality

Heracles: high level controller overview

Goal: meet SLO, keep BE from saturating shared resource

Runs on each machine

LC

workload

Controller

CPU +

Memory

CPU

powerNetwork

LLC CPUDRAM

BWDVFS

CPU

PowerHTB

Net.

BW

Latency readings

Can BE grow?

Internal

feedback

loops


LC

workload

Controller

CPU +

Memory

CPU

powerNetwork

LLC CPUDRAM

BWDVFS

CPU

PowerHTB

Net.

BW

Latency readings

Can BE grow?

Internal

feedback

loops



Cores LLC Core freq. Network BW

LC LC Max LCBE BE BE BE

L

BE BEBE BEBE BEBE

LC

workload

Controller

CPU +

Memory

CPU

powerNetwork

LLC CPUDRAM

BWDVFS

CPU

PowerHTB

Net.

BW

Latency readings

Can BE grow?

Internal

feedback

loops



Cores LLC Core freq. Network BW

LC LC Max LCBE BE BE BE

L

Example subcontroller: Core+Memory

Isolates: Cores, LLC, DRAM

Physical mechanisms: Partitioning of cores, LLC, and DRAM

Goal: maximize cores running BE job by minimizing DRAM BW

Guardband in DRAM BW to ensure LC job is not being starved

Iterative phases:

1. Reduce total DRAM BW through LLC partitioning

2. Grow allocation of BE cores



Time

LCD

RA

M B

W

Time

BE D

RA

M B

W

Time

Tota

l D

RA

M B

W

Start here



Time

LCD

RA

M B

W

Time

BE D

RA

M B

W

Time

Tota

l D

RA

M B

W

Reduce BW



Time

LCD

RA

M B

W

Time

BE D

RA

M B

W

Time

Tota

l D

RA

M B

W

Reduce BW


∇≈ 0Negligible benefit


Time

LCD

RA

M B

W

Time

BE D

RA

M B

W

Time

Tota

l D

RA

M B

W

+ BE cores



Time

LCD

RA

M B

W

Time

BE D

RA

M B

W

Time

Tota

l D

RA

M B

W

+ BE cores


Danger zone


Time

LCD

RA

M B

W

Time

BE D

RA

M B

W

Time

Tota

l D

RA

M B

W

Reduce BW



Time

LCD

RA

M B

W

Time

BE D

RA

M B

W

Time

Tota

l D

RA

M B

W

+ BE cores


Hit BW cap

Evaluation of Heracles

Evaluation of Google production workloads on real hardware


Latency Critical workloads

websearch

Leaf node, document retrieval/scoring

99%-ile latency SLO of tens of milliseconds

ml_cluster

Machine learning for text clustering

95%-ile latency SLO of tens of milliseconds

memkeyval

In-memory key-value store

99%-ile latency SLO of hundreds of microseconds


Production

Best Effort jobs

stream-LLC: LLC antagonist

stream-DRAM: DRAM BW antagonist

cpu_pwr: CPU power antagonist

brain: deep learning (LLC, DRAM, CPU, CPU power)

streetview: image stitching (DRAM BW)

Run Heracles on real hardware, measure latency and utilization


Synthetic

Production

Latency validation: do no harm

SLO latency

Iso-latency: recovering

slack and turning it into work


Putting it together: resource efficiency

Effective Machine Utilization = (LC load) + (% BE throughput)


Load on LC app

Free batch processing

capability

Putting it together: resource efficiency

Effective Machine Utilization = (LC load) + (% BE throughput)

Better than 100% is due to better

binpacking


Bonus: energy efficiency too!

Power increase is far less than resource utilization increase!


300% more work for

60% more power

Cluster results

Use load trace for off-peak hours on websearch cluster


Conclusion

Increasing utilization is key to improving datacenter efficiency

Fine-grained knobs to control many sources of interference

Need coordinated policy to find optimal settings

Heracles significantly increases utilization

Achieves average of 90% utilization for Google workloads

Potential increase of >300% in cost efficiency


Documents

Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core