17
PUE: when a perfect metric becomes the enemy of good datacenter design Cyrille Brisson Janne Paananen Eaton Electrical

PUE: when a perfect metric becomes the enemy of good datacenter design

Embed Size (px)

DESCRIPTION

Presentation by Cyrille Brisson, Vice President of Marketing EMEA, Eaton’s Electrical Sector, at DCD Europe, 18th November 2015

Citation preview

PUE: when a perfect metric

becomes the enemy of good

datacenter design

Cyrille Brisson – Janne Paananen

Eaton Electrical

Why PUE is a great, widely

adopted hygiene KPI

• PUE measures energy expanded for non-added value tasks

– Removing power defects that need not exist

– Removing heat, that is a computing by-product

• PUE clearly helps to improve efficiency

– Short-term by providing a baseline for improvement

– Long-term by pointing at obvious waste even if it is not

created as an industry benchmark

There are other, less good reasons

why PUE is so popular…

• It is the only reasonably well defined and accepted energy

efficiency measurement in the datacenter industry

• It fits both the organization of traditional enterprise datacenters

and the co-lo space by separating Facilities and IT efficiency

• It provides a simple number everyone believes they understand,

and that has sometimes turned into a marketing gimmick

… but overall it provides a useful

benchmark for newly built DC

• Improvement has accelerated now that PUE is widely

adopted, and new datacenters should aim for a PUE <1.2

–because they can!

– Good hygiene (containment, 5S…) and modern

cooling technologies have made <1.1 partial PUE

achievable in most western climates

– Modern critical power chain can achieve <1.1 partial

PUE at any load by leveraging DSPs, 3-level

inverters and smart reticulation

DSPs help increase efficiency

without trade-offs on resiliency…Real-time distributed monitoring of critical bus in UPS enables

synchronized behavior for time-critical situations –no SPF

… by enabling instant adaptation of

the power chain to the load levelDecentralized processing allows for instant reactions

The coming IEC 30134 will correct

some potential issues such as…• Confirming that the PUE has to be calculated as an average

value over 12 months, not a spot theoretical value

• Standardizing calculation to help spread best practices in

design and improvement

• Establishing clear criteria for using it as a benchmark for new

datacenters

… and there are potential further

ideas for improvementPUE should be calculated not only around the year, but

at different load levels (20, 40, 60, 80%) to anticipate

the full impact of more proportional IT hardware and

larger arrays under the same hypervisor / cloud stack.

But… it is possible to improve a

datacenter’s overall efficiency and

degrade PUE at the same time!

• Example #1: a DC equipped with monolithic UPS with

2-level converters degraded its PUE by:

– Consolidating loads and decommissioning servers

– Upgrading to more energy-proportional servers

• Example #2: a DC degraded its PUE by enabling heat

recovery by the district heating:

– Using a pump to drive the heat exchanger

between their water circuit and the district

heating’s

Chasing a super-low PUE can be

positively harmful to overall IT efficiency

• 2 recent trends can cause more harm than good if applied with

insufficient due diligence:

– Moving too much resilience from infrastructure to IT may cause

the increase of IT redundancy needed to maintain the SLAs of

critical applications to outstrip facility gains

– The latest increases in recommended operating temperatures

clearly threaten IT efficiency

Applications have varying uptime

and consistency requirements

• At the most critical end of the spectrum, some

applications (e.g. financial / payment records) run on

ACID databases with no RPO or RTO allowance

• Even some less-critical applications have SLAs that

will force IT redundancy to go up, as the probability

of failure of leaner infrastructure increases

• The cost of keeping multiple versions of applications

and data synchronized increases fast with the

number of instances (linked to probability of failure),

distance & latency

Eliminating power protection suits

only certain types of applications

• If you run a small number of applications not using

ACID databases over a large number of globally

distributed datacenters, you can probably eliminate

backup power layers and tolerate faults

• If you run stateful customer loads and rely on

traditional databases, “saving” on infrastructure could

cost you a lot in duplicated HW and bandwidth

High temperature reduces IT

efficiency in 3 ways

• Processors power leakage

• Fan power

• Vibrations of fans and InRow kill hard drives

efficiency by increasing the number of cycles

required to complete transactions

CPU leakage

current

increases

exponentially

with the CPU

junction

temperature…

… and running fans flat out to cool the CPU

may not help your power consumption much

As Fans and InRow vibration level increases,

IO throughput performance drops and time

taken to complete workloads goes up

OLTP Workload

Throughput vs. Vibration

Time taken to update 10TB database

vs. Vibration

Conclusion

• A datacenter is a system deliver IT services at a

certain environmental cost, and must be considered

as a system

– Best practices show what is possible: <1.2

– Experimenting beyond best practices induces

trade-offs that must be carefully researched

• There is no such thing as a free lunch.