Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock...

Preview:

Citation preview

Achieving Power-Efficiency in Clusters without Distributed

File System ComplexityHrishikesh Amur, Karsten Schwan

Georgia Tech

Green Computing Research Initiative at GT

Circuit level: DVFS, power states, clock gating (ECE)

Chip and Package: power multiplexing, spatiotemporal migration (SCS, ECE)

Board: VirtualPower, scheduling/scaling/operating system… (SCS, ME, ECE)

Rack: mechanical design, thermal and airflow analysis, VPTokens, OS and management (ME, SCS)

Pow

er

dis

trib

uti

on a

nd d

eliv

ery

(EC

E)

http://img.all2all.net/main.php?g2_itemId=157

Datacenter and beyond: design, IT management, HVAC control… (ME, SCS, OIT…)

focus of our work:

Data-intensive applications that use distributed storage

Focus

CPUMemoryPCI slotsMotherboardDisksFan

Per-system Power Breakdown

Power off entire nodes

Approach to Power-Efficiency of Cluster

Turning Off Nodes Breaks Conventional DFS

Turning Off Nodes Breaks Conventional DFS

Turning Off Nodes Breaks Conventional DFS

Turning Off Nodes Breaks Conventional DFS

Turning Off Nodes Breaks Conventional DFS

Turning Off Nodes Breaks Conventional DFS

Turning Off Nodes Breaks Conventional DFS

One replica of all data placed on a small set of nodes

Primary replica maintains availability, allowing nodes storing other replicas to be turned off [Sierra, Rabbit]

Modifications to Data Layout Policy

Where is new data to be written when part of the cluster is turned off?

Handling New Data

New Data: Temporary Offloading

Temporary off-loading to ‘on’ nodes is a solution

Cost of additional copying of lots of data

Usage of network bandwidth

Increased complexity!!

New Data: Temporary Offloading

Failure of primary nodes cause a large number of nodes to be started up to restore availability

To solve this, additional groups with secondary, tertiary etc. copies have to be made.

Again, increased complexity!!

Handling Primary Failures

Making a DFS power-proportional increases its complexity significantly

Provide fine-grained control over what components to turn off

Our Solution

Switch between two extreme power modes: max_perf and io_server

How do we save power?

Fine-grained control allows all disks to be kept on maintaining access to stored data

How does this keep the DFS simple?

Prototype Node Architecture

SATA Switch

Asterix Node

Obelix Node

Prototype Node Architecture

SATA Switch

Asterix Node

Obelix Node

VMM

max_perf Mode

SATA Switch

Asterix Node

Obelix Node

VM

io_server Mode

SATA Switch

Asterix Node

Obelix Node

VM

1 2 3 40

10

20

30

40

50

60

70

80

90

ObelixAsterix-II

Servers in max_perf mode

Th

rou

gh

pu

t/W

att

(M

B/s

/W)

Increased Performance/Power

1 2 3 40

10

20

30

40

50

60

70

80

90

ObelixAsterix-II

Servers in max_perf mode

Th

rou

gh

pu

t/W

att

(M

B/s

/W)

Increased Performance/Power

1 2 3 40

10

20

30

40

50

60

70

80

90

ObelixAsterix-II

Servers in max_perf mode

Th

rou

gh

pu

t/W

att

(M

B/s

/W)

Increased Performance/Power

1 2 3 40

10

20

30

40

50

60

70

80

90

ObelixAsterix-II

Servers in max_perf mode

Th

rou

gh

pu

t/W

att

(M

B/s

/W)

Increased Performance/Power

Obelix Asterix0

10

20

30

40

50

60

70

80

90

LinuxdomUdom0domU*

Th

rou

gh

pu

t (M

B/s

)

Virtualization Overhead: Reads

Obelix Asterix0

10

20

30

40

50

60

70

80

LinuxdomUdom0domU*

Th

rou

gh

pu

t (M

B/s

)

Virtualization Overhead: Writes

Turning entire nodes off complicates DFS

Good to be able to turn components off, or achieve more power-proportional platforms/components

Prototype uses separate machines and shared disks

Summary

Load Management Policies Static

◦ e.g., DFS, DMS, monitoring/management tasks… Dynamic

◦ e.g., based on runtime monitoring and management/scheduling…

◦ helpful to do power metering on per process/VM basis

X86+Atom+IB…

VM-level Power Metering: Our Approach

Built power profiles for various platform resources◦ CPU, memory, cache, I/O…

Utilize low-level hardware counters to track resource utilization on per VM basis◦ xenoprofile, IPMI, Xen tools…◦ track sets of VMs separately

Maintain low/acceptable overheads while maintaining desired accuracy◦ limit amount of necessary information, number of monitored

events: use instructions retired/s and LLC misses/s only

◦ establish accuracy bounds

Apply monitored information to power model to determine VM power utilization at runtime◦ in contrast to static purely profile-based approaches

Recommended