View
212
Download
0
Category
Tags:
Preview:
Citation preview
© 2009 IBM Corporation
IBM Power Systems
ImplementingCEC Concurrent Maintenance
Ron BarkerIBM Power Systems Advanced Technical Supportrfbarker@us.ibm.com
IBM Power Systems
© 2009 IBM Corporation
Overview
CEC Concurrent Maintenance (CCM) offers new capabilities in Reliability, Availability and Serviceability (RAS)
Concurrent add and upgrade functions enable the expansion of the processor, memory, and I/O hub subsystems without a system outage
If prerequisites have been met, repairs can be made on the system processor, memory, I/O hub, and other CEC hardware without a system outage
Accomplishing CCM requires careful advance planning and meeting all prerequisites
If desired, customers can continue to schedule maintenance during planned outages
IBM Power Systems
© 2009 IBM Corporation
Terminology
Concurrent Maintenance: An add, upgrade or repair made while the server is running
Some system elements may be unavailable during maintenance, but a re-IPL is NOT required to reintegrate all resources
Concurrent Add/Upgrade: Adds new or exchanges hardware components while the system is running
Node: A physical group of processor, memory, and I/O hubs in the system (595 processor book, 570 CEC drawer or module)
Node evacuation: Frees up processor and memory resources from the target node and replaces them with CPU and memory from other nodes if available. De-allocates I/O resources so the node can be electrically isolated from the system for concurrent maintenance.
IBM Power Systems
© 2009 IBM Corporation
Terminology
Concurrent Cold Repair: Repairs to components electrically isolated from the running system (de-allocated or “garded”) before the current repair action was started
Repairs following a system shutdown and reboot after hardware failure Reintegration following repair does NOT require a reboot
Concurrent Hot Repair: Repairs on components that will be electrically isolated from the running system during the repair action
Reintegration following repair does NOT require a reboot
Non-Concurrent Repair: Repairs requiring the system be powered off
GX Adapter: An I/O hub which connects I/O expansion units to the processors and memory in the system (e.g., RIO-2, 12X adapters)
IBM Power Systems
© 2009 IBM Corporation
Planning and Prerequisites
CCM has both hardware and firmware prerequisites
Power Systems 595 and 570 only
Hardware Management Console V7R3.4.0 MH01163_0401 (SP 1) or later System firmware EH340_061 and EM340_061 or later
This update has deferred content requiring a re-IPL to activate enhancements
Power System 570 concurrent node add requires that the system cable be connected in advance (cannot be added concurrently)
Adding new GX adapters concurrently requires that sufficient system memory has been reserved in advance; here are the defaults that may need to be increased:
Power Systems 595, 1 additional per node, 2 maximum, if slots available Power Systems 570, 1 additional maximum, if an empty slot available
IBM Power Systems
© 2009 IBM Corporation
Planning and Prerequisites
System configurations should allow for free processors, unused system memory and redundant I/O paths
When a processor node is powered off, all of its resources need to be shifted to another node
Unlicensed Capacity on Demand processors and memory will be used by the system during node evacuation
System CPU and memory usage can be reduced through dynamic reallocation of running partitions, or by shutting down those that are unnecessary
Insufficient processor and memory capacity, or lack of redundant I/O paths, may force shutdown of some or all logical partitions on the system
IBM Power Systems
© 2009 IBM Corporation
Planning and Prerequisites
Preparation for concurrent maintenance begins when the system is ordered and configured
1. Customers decide how much system to buy and how to configure it to take advantage of CCM capability
2. Customers decide whether to use concurrent maintenance techniques or schedule planned outages for upgrades and repairs
IBM Power Systems
© 2009 IBM Corporation
Planning Guides for CEC Concurrent Maintenance
Follow these guidelines:
The system should have enough unused CPU and memory to allow a node to be taken off-line for repair
All critical I/O resources should be configured using multi-path I/O solutions allowing failover using redundant I/O paths
Redundant physical or virtual I/O paths must be configured through different nodes and GX adapters
IBM Power Systems
© 2009 IBM Corporation
Note the Exposure on Node 3
IBM Power Systems
© 2009 IBM Corporation
Partition Considerations
CCM is concurrent from the system point of view, but may not be completely transparent to logical partitions
Temporary reduction in CPU, memory and I/O capabilities could impact performance
To take full advantage of concurrent node add or memory upgrade/add, partition profiles should reflect higher maximum processor and memory values than exist before the upgrade
New resources can then be added dynamically after the add or upgrade
Note: higher partition maximum memory values will increase system memory set aside for Partition Page Tables
IBM Power Systems
© 2009 IBM Corporation
Partition Considerations
I/O resource planning
To maintain access to data, multi-path I/O solutions must be utilized (i.e., MPIO, SDDPCM, PowerPath, HDLM, etc.)
Redundant I/O adapters must be located in different I/O expansion units that are attached to different GX adapters located in different nodes
This can be either directly attached I/O or virtual I/O provided by dual VIO servers housed in different nodes
IBM Power Systems
© 2009 IBM Corporation
Partition Considerations
Check system settings for the server
If shutting down all partitions becomes necessary, make sure the system doesn’t power off during the repair action, prolonging the repair action
Leave this box unchecked
IBM Power Systems
© 2009 IBM Corporation
IBM i Planning Considerations
To allow for a hot node repair/memory upgrade to take place with i partitions running, the following PTFs are also required:
V5R4: MF45678V6R1: MF45581
If the PTFs are not activated, the IBM i partitions have to be powered off before the CCM operation can proceed.
IBM Power Systems
© 2009 IBM Corporation
Rules for Concurrent Maintenance Operations
Guidelines for CCM operations
Only one operation at a time from only one HMC
A second CCM operation cannot be started until the first one has completed successfully
All CCM operations except a 570 GX adapter add must be done by IBM service personnel
On both the 595 and 570, you must have at least two nodes for hot node repair or hot memory add/upgrade
You cannot evacuate a 570 node that has an active system clock
Enable service processor redundancy on a 570 before starting a hot node add, except on a single-node server
Both service processors on a 595 must be functioning
Display Service Effect utility must be run by the system administrator before hot repair or hot memory add/upgrade
Ensure that the system is not in energy savings mode prior to concurrent node add, memory upgrade or concurrent node repair
IBM Power Systems
© 2009 IBM Corporation
Guidelines for All Concurrent Maintenance Operations
With proper planning and configuration, enterprise-class PowerServers are designed for concurrent add/upgrade or repair
However, changing the hardware configuration or the operational state of electronic equipment may cause unforeseen impacts to the system status or running applications
Some highly recommended precautions to consider:
Schedule concurrent upgrades or repairs during off-peak operational hours
Move business-critical applications to another server using the Live Partition Mobility feature or stop them
Back up critical application and system state information
Checkpoint data bases
IBM Power Systems
© 2009 IBM Corporation
Guidelines for All Concurrent Maintenance Operations
Features and capabilities that don’t support CCM
Systems clustered using RIO-SAN technology (This technology is used only by i users clustering using switchable towers and virtual OptiConnect technologies)
Systems clustered using InfiniBand technology (This capability is typically used by High Performance Computing clients using an InfiniBand switch)
I/O Processors (IOPs) used by i partitions do not support CCM (Any i partitions that have IOPs assigned must either have the IOPs powered off or the partition must be powered off)
16 GB memory pages, also known as huge pages, do not support memory relocation (Partitions with 16 GB pages must be powered off to allow CCM)
IBM Power Systems
© 2009 IBM Corporation
Guidelines for Concurrent Add/Upgrade
For adding or upgrading
All serviceable hardware events must be repaired and closed before starting an upgrade
Firmware enforces node and GX adapter plugging order
Only the next node position or GX adapter slot based on plugging rules will be available
For 570 node add, make sure the system cable is in place before starting
If the concurrent add includes a node plus a GX adapter, install the adapter in the node first, then add the entire unit
This way, the 128 MB of memory required by the adapter will come from the new node when it is powered on
IBM Power Systems
© 2009 IBM Corporation
Guidelines for Concurrent Add/Upgrade
For adding or upgrading
For multiple upgrades that include new I/O expansion drawers, as well as node or GX adapter adds, the concurrent node or GX adapter add must be completed first
The I/O drawer can then be added later as a separate concurrent I/O drawer add (a sequential operations)
IBM Power Systems
© 2009 IBM Corporation
Guidelines for Concurrent Repair
Repair with same FRU type:
The node repair procedure doesn’t allow for any additional action beyond the repair
The same FRU type must be used to replace a failing FRU, and no additional hardware can be added or removed during the procedure
For example, if a 4GB DIMM fails, it must be replaced with a 4GB DIMM – not a 2GB or 8GB DIMM
A RIO GX adapter must be replaced with a RIO GX adapter, not an InfiniBand GX adapter
IBM Power Systems
© 2009 IBM Corporation
Customer Responsibilities
The customer is responsible for deciding whether to do a concurrent upgrade or repair or to schedule a maintenance window
The customer must determine whether all prerequisites have been met and the configuration will support a node evacuation, if necessary
In the case of an upgrade, the World-wide Customized Install Instructions (WCII) for the order will ship assuming a non-concurrent installation
The WCII will tell you how to obtain instructions for a concurrent upgrade
All repairs are the responsibility of IBM service personnel
Customers are responsible for adding new 570 GX adapters
IBM Power Systems
© 2009 IBM Corporation
Display Service Effect Utility
The Display Service Effect utility needs to be run by the customer prior to concurrent hot node repair or memory add/upgrade
The utility shows memory, CPU and I/O issues that must be addressed before a node evacuation
The utility runs automatically at the start of a hot repair or upgrade, but it can be run manually ahead of time to determine whether the repair or upgrade will be concurrent
Ideally, this utility should be run by the systems administrator before the arrival of the IBM service representative
The DSE utility is not required if no node evacuation is required, as during a hot GX adapter add or a hot node add
IBM Power Systems
© 2009 IBM Corporation
Starting the Display Service Effect Utility
Note: The Power On/Off Unit dialog box is used only to access the Display Service Effect utility
IBM Power Systems
© 2009 IBM Corporation
Select Display Service Effect
Note: The Power On/Off Unit dialog box is used only to access the Display Service Effect utility
IBM Power Systems
© 2009 IBM Corporation
Select Yes – Confirm Advanced Power Control Command
This is a misleading message: it does NOT mean you’re about to power off your system!
IBM Power Systems
© 2009 IBM Corporation
Display Service Effect Summary Page
Look at the details by clicking the tabs
IBM Power Systems
© 2009 IBM Corporation
Tips on How to View Data
When working with the informational and error messages shown on the Node Evacuation Summary Status panel, work with the Platform and Partition messages first (the first and last tabs)
The impacts to the platform and partitions indicated in these messages may lead to the shutdown of partitions on the system for reasons such as I/O resources being used by the partition in the target node
The shutdown of a partition will free up memory and processor resources
If a partition must be shutdown, use the Recheck button to re-evaluate the memory and processor resources
IBM Power Systems
© 2009 IBM Corporation
Platform – Informational Messages
Check both Errors and Informational Messages
IBM Power Systems
© 2009 IBM Corporation
Memory Impacts
IBM Power Systems
© 2009 IBM Corporation
Processor Impacts
IBM Power Systems
© 2009 IBM Corporation
Partition Impacts – I/O Related Conflicts
IBM Power Systems
© 2009 IBM Corporation
“White Glove” Tracking Program
During the next several months, IBM will track concurrent maintenance operations
In the US, potential concurrent CEC add MES orders and repairs will be pro-actively tracked during the feedback period
In the NE, SW IOTs and CEEMA GMT (EMEA), they will use the "Install PMH" or "Repair PMH" process to request feedback from SSRs
For all geographies, SSRs who perform CCM upgrades, adds and repairs are asked to complete the feedback form located at the following URL:
http://w3.rchland.ibm.com/~cuii/CCM/CCMfeedback_WCII.html
IBM Power Systems
© 2009 IBM Corporation
Summary
CCM gives customers new options for maintaining availability
Careful advance planning is required to make it work
Pre-requisites include creating CPU and memory reserves to allow CCM, as well as configuring redundant I/O paths or preparing for loss of I/O routes during concurrent maintenance
Customers must run the Display Service Effect utility to determine whether a concurrent repair or memory add/upgrade can be initiated
If concurrent repairs are not possible, a regular maintenance window must be scheduled
IBM Power Systems
© 2009 IBM Corporation
Required Reading
Technical white paper, “IBM Power 595 and 570 Servers CEC Concurrent Maintenance Technical Overview”, available at:
ftp://ftp.software.ibm.com/common/ssi/sa/wh/n/pow03023usen/POW03023USEN.PDF
CEC Concurrent Maintenance article in IBM System Hardware Information Center available at:
http://publib.boulder.ibm.com/infocenter/systems/scope/hw/index.jsp?topic=/ared3/ared3kickoff.htm
Recommended