Data Centre Compute and Overhead Costs - Delivering End-to-end KPIs

Michael Rudgyard (CTO)

Concurrent Thinking Ltd

Data Centre Compute and Overhead CostsDelivering End-to-end KPIs

• Background in High Performance Computing & Scale-out Computing– Gives us a unique perspective on DCIM

• Founded Concurrent Thinking in 2010– Focussed on tools for operational efficiency in the Data Centre– Exploit an existing & mature product that was originally developed for HPC– Investment from Carbon Trust Investments– Launched new products at DatacenterDynamics, Nov 2011

Our Background

Bridging the Divides – Facilities, IT & Management

It’s all about virtualization

It’s all about cooling

What constitutes an efficient data centre ??

It’s all about procurement

It’s about staff efficiency

• Continuous monitoring & active management of IT & Facilities systems– Building management systems – Environmental systems (temperature, humidity, air-conditioning..) – Power (at the distribution board, rack PDU and server PSU level …)– IT equipment (including server health)– Operating systems & Virtual Machines– Application Performance

• We leverage standards-based protocols – OPC, Modbus, 1-wire, SNMP, IPMI, Intel Node Manager, WMI

• …and offer monitoring agents and extensible means to monitor non-standard M&E equipment

What we do… Data Centre Infrastructure Management

• Aims– To truly understand where operational savings can be made – To understand how factors vary over time / with load etc– To give ample warning of potential (often critical) issues– To report factual information to management – To drive continuous iterative improvement over time

• Real energy and productivity savings require a ‘joined-up’ approach– Managing buildings, data-centre facilities and IT in a unified manner– .. opening the door to the possibility of orchestration of the data-centre

Why Data Centre Infrastructure Management ?

• We provide a tool that:– Tracks power to the server/network (and OS/VM/application) level– Allows for reporting by department, customer or end-user– Offers a simple interface to present data for different purposes– Has integrated IT asset management– Generates business intelligence on end-to-end service delivery– Is both user-extensible and built to scale (visually & architecturally)

Our Approach

Compute Utilisation Effectiveness

Storage Utilisation EffectivenessNetwork Utilisation Effectiveness

0

0.5

1

• We don’t push particular metrics (eg. PUE, ITUE, ITEE, FVER..)

• DCIM is a tool that should enable a customer to define his own KPIs

What are the important data centre metrics ?

• Potential performance metrics:– CPU utilisation (* CPU benchmark) per watt – IOPS per watt– Bytes per watt

• To produce these metrics we monitor:– OS metrics via SNMP (Linux/MS) or WMI (MS) – Server power usage (via a managed PDU or IPMI)– (CPU benchmark figure) – Power overhead for cooling and power distribution etc (and apportion this for this server)– Power cost (at different times)

Example 1 – OS performance monitoring

• For a typical MS Exchange service, the most useful metrics might be:– Power usage per email (OPEX only)– Cost per email (OPEX or OPEX + CAPEX)– CO2 per Email

• WMI now provides the necessary application performance metrics

– The number of email transactions – Server power usage (as above)– Power overhead for cooling and power distribution etc. (as above)– Power Cost (as above)– Asset depreciation model

Example 2– Microsoft Exchange

• For a web service, the most useful metric might be:– Power per database query– Cost per database query– CO2 per database query

• SNMP now provides the application performance information

Example 3 – Linux MySQL Server

• For a web service, the most useful metric might be:– Power per HTML query– Cost per HTML query– CO2 per HTML query

• Unfortunately, SNMP support for Apache is poor– Best option was to install the Apache ‘status module’ – Read the number of web transactions from the status module web page

Example 4 – Linux Apache Web Server

• Assume a single application per virtual machine

• Issue now is: what is the power used by a virtual machine ?

• Our solution: ‘inferred metrics’– Use another metric (eg. CPU utilisation) as a proxy for power usage– Attribute the power used by a server to individual VMs

Application performance on virtual machines

• Which servers are underused/inefficient/should be virtualised ?

• Which servers are better at delivering a particular service ?– Provides useful procurement information !– (or which application gives better performance on the same hardware ?)

• When should I retire old servers ?– Sweating IT assets is often a very bad idea indeed !

Using this information (1)

• Which departments are using their IT resources wisely ?– Define server groups and report by department

• Charge departments for individual power usage

Using this information (2)

• It is straightforward to monitor many KPIs for a data centre– From PUE, to ITUE and “application utilisation efficiency”– Requires a proper monitoring & reporting tool, with inbuilt asset management– Requires power monitoring hardware (managed PDUs or modern servers)– Requires suitable configuration (relatively easy for small numbers of apps)

• It is straightforward to apportion costs by racks, servers and by department (if application servers are not shared)

• The ROI can be very significant

• Can we monitor granular information by user at the app level ?– On going collaborations with University of Hertfordshire and Surrey University– Collaboration on HPC with HPC Wales and STFC Daresbury

Conclusions and open questions

Technology

Data Centre Compute and Overhead Costs - Delivering End-to-end KPIs