28
© 2014 Uptime Institute Is your data center on the verge of a crisis? Julian Kudritzki Chief Operating Officer Uptime Institute

Is your data center on the verge of a crisis?

Embed Size (px)

DESCRIPTION

What are the symptoms of a poorly managed data center facility? How close are you to an operating failure or catastrophic downtime event? Learn how to spot the warning signs and start improving your facility management program immediately to minimize the risk of downtime, reduce costs, and upgrade your operations.

Citation preview

Page 1: Is your data center on the verge of a crisis?

© 2014 Uptime Institute

Is your data center on the verge of a crisis?

Julian Kudritzki Chief Operating Officer

Uptime Institute

Page 2: Is your data center on the verge of a crisis?

What Defines a Crisis?

2

Page 3: Is your data center on the verge of a crisis?

Tour of Operational Computer Room

3

Page 4: Is your data center on the verge of a crisis?

Looking for Clues

4

Page 5: Is your data center on the verge of a crisis?

Tour of ‘Live’ Critical Spaces

5

Page 6: Is your data center on the verge of a crisis?

Daily Practices Compromise Uptime, Safety, and Security

6

Page 7: Is your data center on the verge of a crisis?

•  Overtime hours exceeding 10% •  Voice mail boxes full •  Emails not responded to •  Email inbox size limit exceeded •  Meetings missed or routinely cancelled •  No time for training •  Shortage of qualified staff •  Personnel performing work outside their competency •  Everything is an emergency •  Personnel turnover

What Else Is Going On?

7

Page 8: Is your data center on the verge of a crisis?

•  Break fix budget exceeded •  Maintenance budget exceeded •  Energy cost estimate exceeded or unknown •  Last minute deployment requirements •  No organization chart •  No responsibilities matrix •  No records of maintenance activities •  No written policies & procedures •  No preventive maintenance schedule •  Back of the server looks like a spaghetti pot exploded

The Issues Add Up

8

Page 9: Is your data center on the verge of a crisis?

•  Cabling is not labeled or worse incorrectly labeled •  Equipment is not uniquely labeled •  Loads are consistently out of balance •  Capacities are not managed or tracked •  Deferred maintenance exceeds 10% •  Housekeeping: if it looks like a mess, it is a mess Maybe you don’t have a crisis, but how do you know how well your data center operation compares to rest of industry?

The Issues Add Up

9

Page 10: Is your data center on the verge of a crisis?

Are you confident in your Facilities team’s capability to manage a technologically advanced and highly efficient design to your 24 x 7 uptime requirements?

•  Can you easily replace any member of that team? •  Are you protected against poor operations practices

migrating from older sites to higher criticality data centers? •  Do you have sites that operate in isolation, ignoring global

corporate standards? •  Do you even have corporate global standards? •  If you outsource any aspect of your data center operations,

how do you avoid losing responsibility and accountability? •  Do you manage an outsourcing contract. . . . or direct an

expert team?

Ask the Tough Questions

10

Page 11: Is your data center on the verge of a crisis?

•  Initial review •  Gap analysis against industry best practices

§  Staffing and Organization §  Maintenance §  Training §  Planning, Coordination & Management §  Operating Conditions

•  Roadmap to operational excellence •  Plan changes •  Implement changes •  Monitor & refine •  Annual review

Path to Data Center Operations Success

11

Page 12: Is your data center on the verge of a crisis?

Key Elements of Facilities Management Staffing and Organization

•  Staffing •  Qualifications •  Organization

Maintenance •  Preventative Maintenance (PM)

Program •  Housekeeping Policies •  Maintenance Management

System (MMS) •  Vendor Support •  Deferred Maint. Program •  Predictive Maintenance •  Life-Cycle Planning •  Failure Analysis Program

12

Page 13: Is your data center on the verge of a crisis?

Key Elements of Facilities Management Training

•  Data Center Staff •  Vendors

Planning, Coordination, and Management

•  Site Policies •  Financial Management •  Reference Library •  Computer Room Mgmt.

Operating Conditions •  Load Management •  Operating Set Points •  Alternating Use of

Infrastructure Equipment

13

Page 14: Is your data center on the verge of a crisis?

The Uptime Institute over the years has observed management issues posing the largest risk to uptime physical infrastructure

•  Inadequate staffing •  Ineffective or non-existing maintenance and training programs •  Lacking processes and procedures •  Resulting in the majority of outages being caused by

‘human error’ No standard existed to help Owners/Operators determine

•  Common language/vocabulary  of  data  center  operations •  Focus of data center management •  Resource allocation •  Resource requirements

Genesis of Industry Best Practices

14

Page 15: Is your data center on the verge of a crisis?

Data Center Owners / Operators / End Users •  Increased availability and cost savings •  Multi-site consistency •  Benchmark for continuous monitoring and refinement

Colocation / Managed Services Sites

•  All of the above plus… •  Customer assurance of consistency •  Competitive differentiator (attain & retain certification)

Industry Benchmark

•  No need to reply on opinions and anecdotes

Value of Industry Best Practices

15

Page 16: Is your data center on the verge of a crisis?

Uptime Institute has been conducting Operational Sustainability Reviews for approximately 3 years— based upon decades of site operations knowledge and experience:

•  Operational Sustainability Certifications: Tier + Gold, Silver, or Bronze •  Management & Operations (M&O) Stamps of Approval

See http://uptimeinstitute.com/publications for Tier Standard: Operational Sustainability

Best Practices Reviews

16

Page 17: Is your data center on the verge of a crisis?

Staffing •  Inadequate staffing •  Excessive overtime (over 10%) •  No escalation process

Qualification

•  No list of required qualifications •  No experience with data center specific equipment

Organization

•  Roles and Responsibilities not documented •  Data center organization not integrated

Staffing and Organization Significant Findings

17

Page 18: Is your data center on the verge of a crisis?

Preventive Maintenance (PM) •  No list of required PM activities •  PM activities not fully scripted •  No quality control process

Housekeeping

•  Combustibles in the data center •  No documented housekeeping policy

Maintenance Management System (MMS)

•  No list of equipment •  Missing critical data: warranty info, maintenance history, performance

data, etc.

Maintenance Significant Findings

18

Page 19: Is your data center on the verge of a crisis?

Vendor Support •  Contracts missing response times, call-in process, detail SOW, or

technician qualifications Deferred Maintenance

•  Unable to produce Deferred maintenance report from MMS Predictive Maintenance

•  No predictive maintenance program •  Not comparing current results with previous results

Maintenance Significant Findings

19

Page 20: Is your data center on the verge of a crisis?

Life-Cycle Planning •  No life-cycle plan •  Not using MMS data to develop plan

Failure Analysis •  No record of outages or near misses

Maintenance Significant Findings

20

Page 21: Is your data center on the verge of a crisis?

Data Center Staff •  Undocumented On-the-Job (OJT) programs •  No formal qualification program •  No list of training required by position •  No formal training program with lesson plans, etc.

Vendors •  No briefing for escorted vendors

Training Significant Findings

21

Page 22: Is your data center on the verge of a crisis?

Load Management •  Alarm settings not documented •  Alarms not set on PDUs to ensure maximum loads are not exceeded

Operating Set Points •  Cooling set points are not document or part of

Change Management Process •  Changing of set points is not controlled

Operating Conditions Significant Findings

22

Page 23: Is your data center on the verge of a crisis?

Site Policies •  Missing Site Policies •  Especially Site Configuration Policy

Reference Library

•  No process for keeping documents up-to-date

Capacity Management •  No process for forecasting future space, power, and cooling

requirements •  No active tracking of cooling capacity •  Ineffective management of Cold Aisles /Hot Aisles •  Electrical power monitoring (balancing phases)

Planning, Coordination, and Management Significant Findings

23

Page 24: Is your data center on the verge of a crisis?

Facilities •  Operate and maintain the critical facility infrastructure •  Support the installation of IT equipment (space, power, & cooling)

IT Management •  Operate and maintain IT hardware, software, applications, and

network connectivity •  Manage the installation/de-installation of IT equipment

Security •  Access Control •  Physical Security

Typical Data Center Disciplines

24

Page 25: Is your data center on the verge of a crisis?

Functionally Separate Organization •  Corporate Real Estate (Facilities) •  IT •  Security

Communication between organizations was typically poor

•  Data center activities conducted without coordination •  Poor future space, power, and cooling planning

No individual responsible for all aspects of operating a data center

Past Organizational Structures

25

Page 26: Is your data center on the verge of a crisis?

Factors driving changes to organizational structure •  Rapid changes in technology and speed at which capacity must be

brought online •  Increased costs associate with IT and Facilities •  Business objectives of continuous computing availability

Legacy organizations could not accommodate quickly evolving business requirements

•  Slow to respond •  Not integrated

Evolving Organizational Structure

26

Page 27: Is your data center on the verge of a crisis?

The value of industry best practices is in the process of continuous improvement

•  Discovery leads to learning •  Learning leads to change •  Change leads to improvement •  Regular reviews leads to discovery •  Crises can be avoided

Summary

27

Page 28: Is your data center on the verge of a crisis?

For more information contact: Julian Kudritzki

[email protected] 206.706.4143

Questions?

© 2014 Uptime Institute 28