29
Data Centre Tier Classifications Developed from The Uptime Institute & TIA-942 Single Line Diagrams

Chloride - Data centre tier classifications

Embed Size (px)

Citation preview

Page 1: Chloride - Data centre tier classifications

Data Centre Tier Classifications

Developed from The Uptime Institute & TIA-942 Single Line Diagrams

Page 2: Chloride - Data centre tier classifications

Benchmarking Data Centre ‘Quality’

� There has long been the need to be able to ‘measure’ the ‘quality’ of a critical facility

� The ‘quality’ is usually expressed as Availability of the IT functionality of the facility in terms of ‘number-of-nines’

- e.g. Three Nines = 99.9% Availability

- Note that several engineered and human systems have to contribute to the whole facility and its IT functionality, including the IT hardware and software itself

� At this top-level it can be applied over a short timeframe- e.g. 99.9% over 1 year = one failure event lasting 8.76 hours

- It should never be assumed to cover multiple failure events

For further treatment of Availability and MTBF/MTTR see the Appendix A slides

Page 3: Chloride - Data centre tier classifications

IT functionality – Three-Nines?

� Your mission critical hardware can only deliver its maximum potential if the whole facility ‘works’

- IT Hardware & Connectivity

- Power & Cooling

- Fire detection, alarm and suppression

- Maintenance and emergency intervention

- Security, internal and external, physical and software attack

- Human Error, EPO, Systems Training & Facility Management

- External disasters – earthquake, hurricane, flood, fire .. air-crash

- 20+ systems are critical to the success of the IT functionality

99.99%20 = 99.8% = 17.5h downtime per year99.999%20 = 99.98% = 1.75h downtime per year

Page 4: Chloride - Data centre tier classifications

The Uptime Institute

� The Uptime Institute [1] has, for more than 10 years, sponsored research and practical studies into data centre design, operation and resultant resilience and developed a Tier Classification to describe and differentiate facilities from an availability standpoint

� A White Paper [2] from the Institute (authors of which include the originator of dual power supplies in IT equipment and the Tier system itself) is the basis of this review of the facility and operational concepts

� The Uptime Institute is a commercial organisation and the guidelines it created are not in the form of a technical standard. However much of the principles and details have been incorporated in TIA-942 (see next slide)

[1] The Uptime Institute, Building 100, 2904 Rodeo Park Drive East, Santa Fe, NM 87505, USA

[2] Title: Industry Standard Tier Classifications Define Site Infrastructure Performance, Turner, Seader & Brill, © 2001-2005 The Uptime Institute, Inc

Page 5: Chloride - Data centre tier classifications

American ANSI/TIA Standard

� ANSI/TIA-942-2005

- Telecommunications Infrastructure Standard for Data Centers

� Telecommunications Industry Association- Standards and Technology Dept, 2500 Wilson Boulevard, Arlington, VA 22201, USA

- www.tiaonline.org/standards/search_n_order.cfm

� Follows the same Tier I-IV format and draws heavily on The Uptime Institute publications but extends the detail, especially in connectivity

� Entirely a USA centric ANSI specification, so can only be used as a guide in EN territories

� Specifically for telecom related data-centre environments and <2700W/m2

Page 6: Chloride - Data centre tier classifications

Tier Classification – Tier I to IV

� The classification system takes into account that at least 20 major M&E systems contribute to the overall IT availability (such as fire alarms, EPO etc) and that Tier IV represents 99.995% site availability (over a five year period) with the critical systems loaded to a maximum of 90%

� Each and every system has to meet this table:

99.9999%20 = 99.998% = 1h downtime every 5 years20 discrete systems, each of Six-Nines Availability!

Page 7: Chloride - Data centre tier classifications

Tier IV – the ultimate in resilience?

� Fault Tolerant – defined as ‘a site that that can sustain at least one unplanned worst-case infrastructure failure with no critical load impact’

� Concurrently Maintainable – defined as ‘a site that is able to perform planned maintenance activity without shutting down the critical load’. Note that it is acceptable that the fault tolerance level will be reduced during maintenance or after the first fault

� Tier IV Classification only applies to dual power supply loads where complete functionality is obtained with either power supply fed and where the two inputs, in normal operation, share the power demand, as defined by The Uptime Institute’s own specification [1]

� A technical and philosophical argument reigns about Static Transfer Switches for single-cord loads in Tier IV designs

- Is that Tier III.5 or IV.5?

[1] Title: Fault Tolerant Power Compliance Specifications, v2.0, see www.uptimeinstitute.org

Page 8: Chloride - Data centre tier classifications

Electrical Single Line Diagrams

� There is no compunction on the designer to strictly follow the designs derived from the Tier Classifications. In many cases compromises will have to be made

- The benchmarking function of the Tier system then provides a useful yardstick to measure against

� In the rest of this presentation we only refer to the Electricalsystems, just one of the 20+ engineered systems that are required to gain a Classification rating

� A particular facility’s Tier rating is the lowest of all its system Tier Classifications

- Tier IV power +Tier III ‘all other’ + Tier II cooling = Tier II Facility

Page 9: Chloride - Data centre tier classifications

Tier I – most critical power systems

A basic single-bus critical power system suitable for single-corded IT loads

There is no specific redundancy called for, although it can be argued that the standby generator set is redundant for the grid supply

Although only ‘N’ is specified, the designer should avoid multiple components in power-parallel configuration as it drastically reduces the potential Availability, i.e. N=1 is best

Maintenance generally involves supplying the load with non-UPS power and an annual load shut-down

Availability of Power at load typically 99.95%*

*Over 5 years operation

Page 10: Chloride - Data centre tier classifications

Tier II – increasing levels of redundancy

A single-bus power system suitable for both single and dual-corded loads

Redundancy is called for in the standby generator installation to reduce the chance of ‘failure-to-start’, but not the mains supply

N+1 is specified for the UPS so a high degree of maintenance can be concurrent

Load bank connections are mandatory

Dual-corded loads (expected minority) should be fed by separate A+B PDU’s whilst only the single-corded loads should be fed via STS’s (performing a maintenance function rather than Availability enhancement

Note the option of a ‘B’ UPS, practical when dual-cord loads are few

Availability at load typically 99.98%*

*Over 5 years operation

Page 11: Chloride - Data centre tier classifications

Tier III – more redundancy + segregation

A dual-bus power system suitable for both single and dual-corded loads

Redundancy is called for in the mains supply and the standby generator sets. These must be compartmentalised for lower common mode failure, fire etc

N+1 is specified for the UPS so a high degree of maintenance can be concurrent

Dual-corded loads should be fed by separate A+B PDU’s whilst only the single-corded loads should be fed via STS’s (performing a maintenance function rather than Availability enhancement)

Note the ability of a rapid upgrade to a ‘B’ UPS and Tier IV (but don’t forget the other systems)

An important ‘extra’ here is the Load Bus Synchronisation. When the STS’s can have UPS power on one input and the generator supply on the other it is essential (for the load) to have the two supplies within 30°

Availability of Power at the load typically 99.99%

Page 12: Chloride - Data centre tier classifications

Tier IV – the Uptime purist’s configuration

For a further dialogue on this circuit-breaker see Appendix B

Page 13: Chloride - Data centre tier classifications

Tier IV – The Uptime Institute original

� Complete physical segregation of the two power supplies from the grid to the dual-corded load – a true Dual-Bus system

- 2x(N+1) in every system, maximum 90% load

- Concurrent maintenance possible without load shut down and without losing N+1 redundancy

- Needs two grid sub-stations (they will be on the same MV ring) and diverse cable routes into the site

- Two mechanical load power switchboards in dual-bus

- Note! Many engineers question having N+1 on both A & B buses

� ONLY dual-corded loads

- No STS’s, no common point of failure except the load

- Simple to operate, hence reliable

� With care in design, installation, operation and maintenance, 99.999% power Availability possible

Page 14: Chloride - Data centre tier classifications

Tier IV + STS’s + LBS = Tier how many?

Page 15: Chloride - Data centre tier classifications

Tier III.5 or IV.5? That is the question!

� Not all loads are dual-corded

- Load transparent switching via STS’s is a great maintenance tool

� Classic Tier IV but with STS’s for single-corded loads

- Absolutely needs the addition of Load Bus Synchronisation

� Insist on three PDU’s in the data-room

- ‘A’ fed from UPS-A for dual-cord loads

- ‘B’ fed from UPS-B for (the other feed of) dual-cord loads

- ‘A/B’ with STS fed from UPS-A & B for single-cord loads

- Feeding dual-corded loads via STS’s reduces Availability to that of the STS itself and negates the principle of dual-bus

Page 16: Chloride - Data centre tier classifications

Tier IV + STS’s + detail from TIA-942

Page 17: Chloride - Data centre tier classifications

Unique to TIA-942 - in the detail

� Tier IV has to have impedance based battery monitoring systems

� TIA-942 says that when a system (A or B) is shut down for routine maintenance then the maintenance bypass should be energised by aUPS supply

- Not to rely on the dual-corded loads to operate with one feed dead?

- TIA-942, Page 123, RH column ‘UPS Maintenance Bypass Arrangement’

� A third UPS (C) system? Uneconomic, space hungry, utilisation 0.05% and a poor return on investment

- Chloride solution (red-line on diagram)

• Cross-feed the output of each UPS system to the maintenance bypass of the alternate system

• Manual control, padlocked and interlocked isolators, break-before-make, no hot-transfer, no point of common coupling in an auto-mode, sync-check blocking relays across breakers = safe

Page 18: Chloride - Data centre tier classifications

Secure Power Always

Ian F BitterlinInternational Sales Director

Contact detailsTel: +44 (0) 7717 467 579E mail: [email protected]: www.chloridepower.com

Page 19: Chloride - Data centre tier classifications

Power AvailabilityYour mission critical hardware can only deliver its maximum potential if it has

continuous voltage available at the input terminals and the rest of the facility ‘works’.

Power Availability = 99.9..?..9%

Cooling Availability = 99.9..?..9%

Security & Fire = 99.9..?..9%

Human Error, Training & Management = 99.9..?..9%

Natural & Unnatural Disaster = 99.9..?..9%

Appendix A – Understanding ‘Nines’

Page 20: Chloride - Data centre tier classifications

� 44 minutes of unsafe drinking water per month

� 3 crash-landings per week at Heathrow

� 3,000 letters lost by The Post Office, every hour

� 2,000 surgical mistakes in the NHS, every week

� 9,000 incorrect banking debits per hour

� 32,000 missed heartbeats, per person, per year

- Not all in one go, please.

Availability: How good is 99.9%?

UK numbers

Page 21: Chloride - Data centre tier classifications

“Availability”The most abused piece of “data” in the industry

� The term, and concept of, “Availability” is often misused to describe the “Quality” of a data center facility power supply

� It is generally assumed (but rarely achieved) that this also applies to the associated cooling system

� Expressed in terms of the percentage uptime, better known as a ‘number-of-nines’

- 99.999% = “Five-Nines”

� The calculation is based upon:- MTBF = the Mean Time Between Failure (hours)

- MDT = the Mean Down Time (hours)

Page 22: Chloride - Data centre tier classifications

Simple Arithmetic

� Availability =

� Example

- MTBF = 25,000 hours

- MDT = 1 hour

- Availability = (25,000 ÷ 25,001) x100 = 99.996%

- Note that a 100% “reliable” system that needs annual maintenance cannot have A=100%!

MTBF

MTBF + MDTx 100%

Page 23: Chloride - Data centre tier classifications

Availability Nines: A measure of “quality”?

MTBF 10 years 1 month 1 dayMDT 1 hour 30 seconds 1 second

Availability 99.99885% 99.99885% 99.99884%

Four-Nines = OK? But do you really want a failure every day?

In reality its worse. Assuming the system recovery time is 6 hours:

MDT 6+1 hours 6h+30s 6h+1sAvailability 99.992% 99.17% 74.99%

Page 24: Chloride - Data centre tier classifications

20ms power events in 12 months?How many computer crashes will you accept?

Availability “Nines” MDT 20ms failures

99.0% 2 87.6 hrs 15,768,000

99.9% 3 8.76 hrs 1,576,800

99.99% 4 53 min 157,680

99.999% 5 5.3 min 15,768

99.9999% 6 31.5 sec 1,577

99.99999% 7 3.15 sec 158

99.999999% 8 315 ms 15

99.9999999% 9 31.5ms 2

The “Nines” cannot be applied to power over a single year!

Better to use MTBF/MDT for ‘one’ failure event

Page 25: Chloride - Data centre tier classifications

There are three common abuses

� 99.9% - sounds good but needs to be looked at carefully

� 99.999..9% (even 100%) with no obvious technical foundation or caveats about time-frame, events or maintenance for the claim

� Hardly anyone ever mentions if there is increased risk during essential maintenance – or if maintenance can be carried out at all without a load shutdown

- The single most important factor in achieving high MTBF figures in the field is correct, expert and timely maintenance

Page 26: Chloride - Data centre tier classifications

“High-Nines” Availability?Beware systems that cannot be maintained!

� A UPS system, cooling system and distribution scheme requires intrusive maintenance every year for four hours - therefore the maximum Availability of “Single-Bus”space is 99.95%

� Can you afford the risk of doing M&E at the same time?- If not, it could be two shutdowns and <99.9%

� Only Dual-Bus systems can be applied and achieve “high nines” with Maintenance

� Most critical systems installed today have a “weaker”cooling design than electrical

- Even though the mechanical system has a much wider fault tolerance the designer should ensure a balance is achieved between the electrical and the mechanical designs, including the “maintenance without shutdown” scenario

Page 27: Chloride - Data centre tier classifications

Appendix B – Load isolation breaker and N+?

� To be able to run the load via the bypass and test the UPS system as a parallel group is a very attractive and useful operational/maintenance feature

- The load isolation breaker enables that function

� Generally that means that between the PDU and the output bus of the UPS system there are at least two MCCB’s or ACB’s in series

- Typical MTBF published at 250,000h (28.5y) with maintenance

� This negates the advantage of applying any reliability enhancement strategy using N+(more than 1)

Page 28: Chloride - Data centre tier classifications

Distribution limits the UPS Availability

Utility/Generator Feed

Critical Load Bus

Maintenance Bypass

Input Switchboard

Output Switchboard

• N+X UPS does not improve things• Bus-voltage Availability depends upon these two switches• Single-bus maximum MTBF = 125,000h (14 years)• e.g. With an 8h MDT that is = 99.99%

Typically 250,000h MTBF eachTwo in series = 125,000h MTBF

Page 29: Chloride - Data centre tier classifications

N+1 redundant UPS architecture: N?

1+1 2+1 3+1

100% Redundancy 50% Redundancy 25% Redundancy600kVA Load2x 600kVA modules 3x 300kVA modules 4x 200kVA modulesR = 10* R = 9 R = 8Day One only Day One to Two Day One to ThreeHighest UPS CapEx Scope for load shrink High scope for load shrinkHigh risk of partial load Medium risk of partial load Low risk of partial loadHigh load step Medium load step Low load step1200kVA of batteries 900kVA of batteries 800kVA of batteries

25% space saving 33% space savingLower battery CapEx etc

*Based on Reliability (R) of a single module = 1