Upload
adeel-naseer
View
299
Download
1
Embed Size (px)
Citation preview
Data Centre Tier Classifications
Developed from The Uptime Institute & TIA-942 Single Line Diagrams
Benchmarking Data Centre ‘Quality’
� There has long been the need to be able to ‘measure’ the ‘quality’ of a critical facility
� The ‘quality’ is usually expressed as Availability of the IT functionality of the facility in terms of ‘number-of-nines’
- e.g. Three Nines = 99.9% Availability
- Note that several engineered and human systems have to contribute to the whole facility and its IT functionality, including the IT hardware and software itself
� At this top-level it can be applied over a short timeframe- e.g. 99.9% over 1 year = one failure event lasting 8.76 hours
- It should never be assumed to cover multiple failure events
For further treatment of Availability and MTBF/MTTR see the Appendix A slides
IT functionality – Three-Nines?
� Your mission critical hardware can only deliver its maximum potential if the whole facility ‘works’
- IT Hardware & Connectivity
- Power & Cooling
- Fire detection, alarm and suppression
- Maintenance and emergency intervention
- Security, internal and external, physical and software attack
- Human Error, EPO, Systems Training & Facility Management
- External disasters – earthquake, hurricane, flood, fire .. air-crash
- 20+ systems are critical to the success of the IT functionality
99.99%20 = 99.8% = 17.5h downtime per year99.999%20 = 99.98% = 1.75h downtime per year
The Uptime Institute
� The Uptime Institute [1] has, for more than 10 years, sponsored research and practical studies into data centre design, operation and resultant resilience and developed a Tier Classification to describe and differentiate facilities from an availability standpoint
� A White Paper [2] from the Institute (authors of which include the originator of dual power supplies in IT equipment and the Tier system itself) is the basis of this review of the facility and operational concepts
� The Uptime Institute is a commercial organisation and the guidelines it created are not in the form of a technical standard. However much of the principles and details have been incorporated in TIA-942 (see next slide)
[1] The Uptime Institute, Building 100, 2904 Rodeo Park Drive East, Santa Fe, NM 87505, USA
[2] Title: Industry Standard Tier Classifications Define Site Infrastructure Performance, Turner, Seader & Brill, © 2001-2005 The Uptime Institute, Inc
American ANSI/TIA Standard
� ANSI/TIA-942-2005
- Telecommunications Infrastructure Standard for Data Centers
� Telecommunications Industry Association- Standards and Technology Dept, 2500 Wilson Boulevard, Arlington, VA 22201, USA
- www.tiaonline.org/standards/search_n_order.cfm
� Follows the same Tier I-IV format and draws heavily on The Uptime Institute publications but extends the detail, especially in connectivity
� Entirely a USA centric ANSI specification, so can only be used as a guide in EN territories
� Specifically for telecom related data-centre environments and <2700W/m2
Tier Classification – Tier I to IV
� The classification system takes into account that at least 20 major M&E systems contribute to the overall IT availability (such as fire alarms, EPO etc) and that Tier IV represents 99.995% site availability (over a five year period) with the critical systems loaded to a maximum of 90%
� Each and every system has to meet this table:
99.9999%20 = 99.998% = 1h downtime every 5 years20 discrete systems, each of Six-Nines Availability!
Tier IV – the ultimate in resilience?
� Fault Tolerant – defined as ‘a site that that can sustain at least one unplanned worst-case infrastructure failure with no critical load impact’
� Concurrently Maintainable – defined as ‘a site that is able to perform planned maintenance activity without shutting down the critical load’. Note that it is acceptable that the fault tolerance level will be reduced during maintenance or after the first fault
� Tier IV Classification only applies to dual power supply loads where complete functionality is obtained with either power supply fed and where the two inputs, in normal operation, share the power demand, as defined by The Uptime Institute’s own specification [1]
� A technical and philosophical argument reigns about Static Transfer Switches for single-cord loads in Tier IV designs
- Is that Tier III.5 or IV.5?
[1] Title: Fault Tolerant Power Compliance Specifications, v2.0, see www.uptimeinstitute.org
Electrical Single Line Diagrams
� There is no compunction on the designer to strictly follow the designs derived from the Tier Classifications. In many cases compromises will have to be made
- The benchmarking function of the Tier system then provides a useful yardstick to measure against
� In the rest of this presentation we only refer to the Electricalsystems, just one of the 20+ engineered systems that are required to gain a Classification rating
� A particular facility’s Tier rating is the lowest of all its system Tier Classifications
- Tier IV power +Tier III ‘all other’ + Tier II cooling = Tier II Facility
Tier I – most critical power systems
A basic single-bus critical power system suitable for single-corded IT loads
There is no specific redundancy called for, although it can be argued that the standby generator set is redundant for the grid supply
Although only ‘N’ is specified, the designer should avoid multiple components in power-parallel configuration as it drastically reduces the potential Availability, i.e. N=1 is best
Maintenance generally involves supplying the load with non-UPS power and an annual load shut-down
Availability of Power at load typically 99.95%*
*Over 5 years operation
Tier II – increasing levels of redundancy
A single-bus power system suitable for both single and dual-corded loads
Redundancy is called for in the standby generator installation to reduce the chance of ‘failure-to-start’, but not the mains supply
N+1 is specified for the UPS so a high degree of maintenance can be concurrent
Load bank connections are mandatory
Dual-corded loads (expected minority) should be fed by separate A+B PDU’s whilst only the single-corded loads should be fed via STS’s (performing a maintenance function rather than Availability enhancement
Note the option of a ‘B’ UPS, practical when dual-cord loads are few
Availability at load typically 99.98%*
*Over 5 years operation
Tier III – more redundancy + segregation
A dual-bus power system suitable for both single and dual-corded loads
Redundancy is called for in the mains supply and the standby generator sets. These must be compartmentalised for lower common mode failure, fire etc
N+1 is specified for the UPS so a high degree of maintenance can be concurrent
Dual-corded loads should be fed by separate A+B PDU’s whilst only the single-corded loads should be fed via STS’s (performing a maintenance function rather than Availability enhancement)
Note the ability of a rapid upgrade to a ‘B’ UPS and Tier IV (but don’t forget the other systems)
An important ‘extra’ here is the Load Bus Synchronisation. When the STS’s can have UPS power on one input and the generator supply on the other it is essential (for the load) to have the two supplies within 30°
Availability of Power at the load typically 99.99%
Tier IV – the Uptime purist’s configuration
For a further dialogue on this circuit-breaker see Appendix B
Tier IV – The Uptime Institute original
� Complete physical segregation of the two power supplies from the grid to the dual-corded load – a true Dual-Bus system
- 2x(N+1) in every system, maximum 90% load
- Concurrent maintenance possible without load shut down and without losing N+1 redundancy
- Needs two grid sub-stations (they will be on the same MV ring) and diverse cable routes into the site
- Two mechanical load power switchboards in dual-bus
- Note! Many engineers question having N+1 on both A & B buses
� ONLY dual-corded loads
- No STS’s, no common point of failure except the load
- Simple to operate, hence reliable
� With care in design, installation, operation and maintenance, 99.999% power Availability possible
Tier IV + STS’s + LBS = Tier how many?
Tier III.5 or IV.5? That is the question!
� Not all loads are dual-corded
- Load transparent switching via STS’s is a great maintenance tool
� Classic Tier IV but with STS’s for single-corded loads
- Absolutely needs the addition of Load Bus Synchronisation
� Insist on three PDU’s in the data-room
- ‘A’ fed from UPS-A for dual-cord loads
- ‘B’ fed from UPS-B for (the other feed of) dual-cord loads
- ‘A/B’ with STS fed from UPS-A & B for single-cord loads
- Feeding dual-corded loads via STS’s reduces Availability to that of the STS itself and negates the principle of dual-bus
Tier IV + STS’s + detail from TIA-942
Unique to TIA-942 - in the detail
� Tier IV has to have impedance based battery monitoring systems
� TIA-942 says that when a system (A or B) is shut down for routine maintenance then the maintenance bypass should be energised by aUPS supply
- Not to rely on the dual-corded loads to operate with one feed dead?
- TIA-942, Page 123, RH column ‘UPS Maintenance Bypass Arrangement’
� A third UPS (C) system? Uneconomic, space hungry, utilisation 0.05% and a poor return on investment
- Chloride solution (red-line on diagram)
• Cross-feed the output of each UPS system to the maintenance bypass of the alternate system
• Manual control, padlocked and interlocked isolators, break-before-make, no hot-transfer, no point of common coupling in an auto-mode, sync-check blocking relays across breakers = safe
Secure Power Always
Ian F BitterlinInternational Sales Director
Contact detailsTel: +44 (0) 7717 467 579E mail: [email protected]: www.chloridepower.com
Power AvailabilityYour mission critical hardware can only deliver its maximum potential if it has
continuous voltage available at the input terminals and the rest of the facility ‘works’.
Power Availability = 99.9..?..9%
Cooling Availability = 99.9..?..9%
Security & Fire = 99.9..?..9%
Human Error, Training & Management = 99.9..?..9%
Natural & Unnatural Disaster = 99.9..?..9%
Appendix A – Understanding ‘Nines’
� 44 minutes of unsafe drinking water per month
� 3 crash-landings per week at Heathrow
� 3,000 letters lost by The Post Office, every hour
� 2,000 surgical mistakes in the NHS, every week
� 9,000 incorrect banking debits per hour
� 32,000 missed heartbeats, per person, per year
- Not all in one go, please.
Availability: How good is 99.9%?
UK numbers
“Availability”The most abused piece of “data” in the industry
� The term, and concept of, “Availability” is often misused to describe the “Quality” of a data center facility power supply
� It is generally assumed (but rarely achieved) that this also applies to the associated cooling system
� Expressed in terms of the percentage uptime, better known as a ‘number-of-nines’
- 99.999% = “Five-Nines”
� The calculation is based upon:- MTBF = the Mean Time Between Failure (hours)
- MDT = the Mean Down Time (hours)
Simple Arithmetic
� Availability =
� Example
- MTBF = 25,000 hours
- MDT = 1 hour
- Availability = (25,000 ÷ 25,001) x100 = 99.996%
- Note that a 100% “reliable” system that needs annual maintenance cannot have A=100%!
MTBF
MTBF + MDTx 100%
Availability Nines: A measure of “quality”?
MTBF 10 years 1 month 1 dayMDT 1 hour 30 seconds 1 second
Availability 99.99885% 99.99885% 99.99884%
Four-Nines = OK? But do you really want a failure every day?
In reality its worse. Assuming the system recovery time is 6 hours:
MDT 6+1 hours 6h+30s 6h+1sAvailability 99.992% 99.17% 74.99%
20ms power events in 12 months?How many computer crashes will you accept?
Availability “Nines” MDT 20ms failures
99.0% 2 87.6 hrs 15,768,000
99.9% 3 8.76 hrs 1,576,800
99.99% 4 53 min 157,680
99.999% 5 5.3 min 15,768
99.9999% 6 31.5 sec 1,577
99.99999% 7 3.15 sec 158
99.999999% 8 315 ms 15
99.9999999% 9 31.5ms 2
The “Nines” cannot be applied to power over a single year!
Better to use MTBF/MDT for ‘one’ failure event
There are three common abuses
� 99.9% - sounds good but needs to be looked at carefully
� 99.999..9% (even 100%) with no obvious technical foundation or caveats about time-frame, events or maintenance for the claim
� Hardly anyone ever mentions if there is increased risk during essential maintenance – or if maintenance can be carried out at all without a load shutdown
- The single most important factor in achieving high MTBF figures in the field is correct, expert and timely maintenance
“High-Nines” Availability?Beware systems that cannot be maintained!
� A UPS system, cooling system and distribution scheme requires intrusive maintenance every year for four hours - therefore the maximum Availability of “Single-Bus”space is 99.95%
� Can you afford the risk of doing M&E at the same time?- If not, it could be two shutdowns and <99.9%
� Only Dual-Bus systems can be applied and achieve “high nines” with Maintenance
� Most critical systems installed today have a “weaker”cooling design than electrical
- Even though the mechanical system has a much wider fault tolerance the designer should ensure a balance is achieved between the electrical and the mechanical designs, including the “maintenance without shutdown” scenario
Appendix B – Load isolation breaker and N+?
� To be able to run the load via the bypass and test the UPS system as a parallel group is a very attractive and useful operational/maintenance feature
- The load isolation breaker enables that function
� Generally that means that between the PDU and the output bus of the UPS system there are at least two MCCB’s or ACB’s in series
- Typical MTBF published at 250,000h (28.5y) with maintenance
� This negates the advantage of applying any reliability enhancement strategy using N+(more than 1)
Distribution limits the UPS Availability
Utility/Generator Feed
Critical Load Bus
Maintenance Bypass
Input Switchboard
Output Switchboard
• N+X UPS does not improve things• Bus-voltage Availability depends upon these two switches• Single-bus maximum MTBF = 125,000h (14 years)• e.g. With an 8h MDT that is = 99.99%
Typically 250,000h MTBF eachTwo in series = 125,000h MTBF
N+1 redundant UPS architecture: N?
1+1 2+1 3+1
100% Redundancy 50% Redundancy 25% Redundancy600kVA Load2x 600kVA modules 3x 300kVA modules 4x 200kVA modulesR = 10* R = 9 R = 8Day One only Day One to Two Day One to ThreeHighest UPS CapEx Scope for load shrink High scope for load shrinkHigh risk of partial load Medium risk of partial load Low risk of partial loadHigh load step Medium load step Low load step1200kVA of batteries 900kVA of batteries 800kVA of batteries
25% space saving 33% space savingLower battery CapEx etc
*Based on Reliability (R) of a single module = 1