232
Power Systems Clustering with high-performance computing by using InfiniBand hardware

Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Power Systems

Clustering with high-performance computing by usingInfiniBand hardware

���

Page 2: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware
Page 3: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Power Systems

Clustering with high-performance computing by usingInfiniBand hardware

���

Page 4: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

NoteBefore using this information and the product it supports, read the information in Notices, the IBMSystems Safety Notices manual, G229-9054, and the IBM Environmental Notices and User Guide,Z125–5823.

This edition applies to IBM Power Systems servers that contain the POWER6 processor and to all associatedmodels.

This edition applies to IBM AIX Version 6.1, to IBM AIX 5L Version 5.3, and to all subsequent releases untilotherwise indicated in new editions.

© Copyright IBM Corporation 2009.US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contractwith IBM Corp.

Page 5: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Contents

Safety notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Clustering with high-performance computing by using InfiniBand hardware . . . . . . 1Overview of clustering systems by using InfiniBand hardware. . . . . . . . . . . . . . . . . . . 1

High-level view of the cluster implementation process . . . . . . . . . . . . . . . . . . . . 2Cluster information resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Fabric communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

IBM GX or GX+ host channel adapters . . . . . . . . . . . . . . . . . . . . . . . . 9Logical switch naming convention . . . . . . . . . . . . . . . . . . . . . . . . . 11Host channel adapter statistics counter . . . . . . . . . . . . . . . . . . . . . . . 12

Vendor switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12QLogic switches supported by IBM . . . . . . . . . . . . . . . . . . . . . . . . . 12Cables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Subnet manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13POWER Hypervisor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Device drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14IBM host stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Management subsystem function overview . . . . . . . . . . . . . . . . . . . . . . . . 14Management subsystem integration recommendations . . . . . . . . . . . . . . . . . . . 15Management subsystem high-level functions . . . . . . . . . . . . . . . . . . . . . . 15Management subsystem overview . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Switch chassis viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Switch command-line interface . . . . . . . . . . . . . . . . . . . . . . . . . . 20Fabric manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Fast Fabric Toolset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Cluster Systems Management . . . . . . . . . . . . . . . . . . . . . . . . . . 22Hardware Management Console . . . . . . . . . . . . . . . . . . . . . . . . . 22Service processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Network Time Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Fabric viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24E-mail notifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Operating system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Management subsystem networks . . . . . . . . . . . . . . . . . . . . . . . . . 25

Vendor log flow to CSM event management . . . . . . . . . . . . . . . . . . . . . . 26Supported components in an HPC cluster . . . . . . . . . . . . . . . . . . . . . . . . 27

Planning for clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Getting started with cluster planning . . . . . . . . . . . . . . . . . . . . . . . . . . 28Cluster planning overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Required level of support, firmware, and devices . . . . . . . . . . . . . . . . . . . . . . 30Server planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Planning InfiniBand network cabling and configuration . . . . . . . . . . . . . . . . . . . 32

Planning for QLogic InfiniBand switch configurations . . . . . . . . . . . . . . . . . . . 32Planning for maximum transfer units (MTUs) . . . . . . . . . . . . . . . . . . . . . 34Planning for global identifier prefixes. . . . . . . . . . . . . . . . . . . . . . . . 35

Configuring an IBM GX host channel adapter . . . . . . . . . . . . . . . . . . . . . . . 36IP subnet addressing restriction. . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Management subsystem planning . . . . . . . . . . . . . . . . . . . . . . . . . . . 37Planning CSM as your systems management application . . . . . . . . . . . . . . . . . . 39

Planning for QLogic fabric management applications . . . . . . . . . . . . . . . . . . . . 40Planning the Fabric Manager and Fabric Viewer . . . . . . . . . . . . . . . . . . . . . 40Planning the Fast Fabric Toolset . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Planning for the fabric management server . . . . . . . . . . . . . . . . . . . . . . . . 47Planning event monitoring with QLogic and CSM . . . . . . . . . . . . . . . . . . . . . 48Planning to run remote commands with QLogic from the CSM/MS . . . . . . . . . . . . . . . 49Frame planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

© Copyright IBM Corp. 2009 iii

Page 6: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Planning installation flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50Key installation points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50Installation responsibilities by organization . . . . . . . . . . . . . . . . . . . . . . . 50

Installation responsibilities of units and devices . . . . . . . . . . . . . . . . . . . . . . 51Order of installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52Installation coordination work sheet . . . . . . . . . . . . . . . . . . . . . . . . . . 57Planning for an HPC MPI configuration . . . . . . . . . . . . . . . . . . . . . . . . . 58Planning for 12X host channel adapter connections . . . . . . . . . . . . . . . . . . . . . 58Tips for planning cluster hardware . . . . . . . . . . . . . . . . . . . . . . . . . . 58Planning check list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59Planning work sheets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Cluster summary work sheet . . . . . . . . . . . . . . . . . . . . . . . . . . . 61Frame and rack planning work sheet . . . . . . . . . . . . . . . . . . . . . . . . . 63Server planning work sheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64QLogic switch planning work sheets . . . . . . . . . . . . . . . . . . . . . . . . . 66

Planning work sheet for 24-port switches . . . . . . . . . . . . . . . . . . . . . . 67Planning work sheet for switches with more than 24 ports . . . . . . . . . . . . . . . . 68

QLogic Fabric Management work sheets. . . . . . . . . . . . . . . . . . . . . . . . 72Cluster Systems Management planning work sheet . . . . . . . . . . . . . . . . . . . . 77

Installing a high-performance computing (HPC) cluster with an InfiniBand network . . . . . . . . . . . 80Overview of the installation tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 81Installation responsibilities for the IBM service representatives . . . . . . . . . . . . . . . . . 81Cluster expansion or partial installation . . . . . . . . . . . . . . . . . . . . . . . . . 81Setting up site power, cooling, and floor . . . . . . . . . . . . . . . . . . . . . . . . . 82Installing and configuring the management subsystem . . . . . . . . . . . . . . . . . . . . 83

Installing and configuring the management subsystem for a cluster expansion or addition . . . . . . . 85Installing and configuring service VLAN devices . . . . . . . . . . . . . . . . . . . . . 87Installing the Hardware Management Console . . . . . . . . . . . . . . . . . . . . . . 87Installing the CSM Management Server . . . . . . . . . . . . . . . . . . . . . . . . 89Installing operating system installation servers . . . . . . . . . . . . . . . . . . . . . 90Installing the fabric management server . . . . . . . . . . . . . . . . . . . . . . . . 91Setting up remote logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

Setting up the CSM/MS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96Setting up a remote log for the fabric management server . . . . . . . . . . . . . . . . 100Using syslog for CSM/MS on RedHat Linux . . . . . . . . . . . . . . . . . . . . . 103

Setting up remote command processing . . . . . . . . . . . . . . . . . . . . . . . 103Installing and configuring servers with management consoles . . . . . . . . . . . . . . . . 106

Installing and configuring the cluster server hardware. . . . . . . . . . . . . . . . . . . . 107Installing and configuring server hardware . . . . . . . . . . . . . . . . . . . . . . 108

Introduction to installing the operating system and configuring the cluster servers . . . . . . . . . . 110Installing and configuring servers when expanding or adding to an existing cluster . . . . . . . . . 111Installing the operating system and configuring the cluster servers . . . . . . . . . . . . . . 111

Installing AIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112Installing Linux. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

Installing and configuring vendor InfiniBand switches. . . . . . . . . . . . . . . . . . . . 116Installing and configuring InfiniBand switches when expanding or adding to an existing cluster . . . . 116Installing and configuring the InfiniBand switch. . . . . . . . . . . . . . . . . . . . . 116

Key points for installing and configuring the InfiniBand switch. . . . . . . . . . . . . . . 117Installing and configuring InfiniBand switches . . . . . . . . . . . . . . . . . . . . 118

Attaching cables to the InfiniBand network . . . . . . . . . . . . . . . . . . . . . . . 120Cabling the InfiniBand network for expansion . . . . . . . . . . . . . . . . . . . . . 121Cabling the InfiniBand network . . . . . . . . . . . . . . . . . . . . . . . . . . 121

Verifying the InfiniBand network topology and operation . . . . . . . . . . . . . . . . . . 122Installing or replacing an InfiniBand GX host channel adapter . . . . . . . . . . . . . . . . . 124

Deferring replacement of a failing host channel adapter . . . . . . . . . . . . . . . . . . 126Verifying the installed InfiniBand network fabric in AIX or Linux . . . . . . . . . . . . . . . . 127

Verifying the GX HCA connectivity by using AIX . . . . . . . . . . . . . . . . . . . . 127Verifying the GX HCA to InfiniBand fabric connectivity by using Linux . . . . . . . . . . . . . 127

Fabric verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127Fabric verification responsibilities . . . . . . . . . . . . . . . . . . . . . . . . . 128Reference documentation for the fabric verification procedures . . . . . . . . . . . . . . . . 128

iv Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 7: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Fabric verification tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128Verifying the fabric operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

Runtime errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129Managing the cluster fabric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

Cluster fabric management flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 129Cluster fabric management components and their use . . . . . . . . . . . . . . . . . . . . 131

Cluster Systems Management . . . . . . . . . . . . . . . . . . . . . . . . . . . 131QLogic Fast Fabric Toolset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

Cluster fabric management tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 133Monitoring the fabric for problems . . . . . . . . . . . . . . . . . . . . . . . . . . 134

Monitoring fabric logs from CSM/MS . . . . . . . . . . . . . . . . . . . . . . . . 134Health checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

Setting up periodic fabric health checking . . . . . . . . . . . . . . . . . . . . . . 136Output files for health check . . . . . . . . . . . . . . . . . . . . . . . . . . 138Interpreting .diff files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141Querying status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

Remotely accessing QLogic management tools and commands from CSM/MS . . . . . . . . . . . 143Remotely accessing QLogic switches from CSM/MS . . . . . . . . . . . . . . . . . . . 144

Updating code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145Updating Fabric Manager code . . . . . . . . . . . . . . . . . . . . . . . . . . 146Updating switch chassis code . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

Finding and interpreting configuration changes . . . . . . . . . . . . . . . . . . . . . . 147Tips: Using iba_report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

Servicing clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149Getting started with servicing clusters . . . . . . . . . . . . . . . . . . . . . . . . . 149

Responsibilities for servicing clusters . . . . . . . . . . . . . . . . . . . . . . . . 150Fault reporting mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . 150Fault diagnosis approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

Types of fabric events . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151Isolating link problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153Scenarios: Restarting or powering on . . . . . . . . . . . . . . . . . . . . . . . 154Network Time Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

Symptoms of problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154Finding the appropriate service procedure. . . . . . . . . . . . . . . . . . . . . . . 157Capturing data for fabric diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . 159

Collect subnet manager and switch chassis data . . . . . . . . . . . . . . . . . . . . 160Capturing switch CLI output . . . . . . . . . . . . . . . . . . . . . . . . . . 161

Capturing problem data for Fabric Manager and Fast Fabric software . . . . . . . . . . . . . 161Mapping fabric devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

Mapping of IBM HCA GUIDs to physical HCAs . . . . . . . . . . . . . . . . . . . 162Finding devices based on a known logical switch . . . . . . . . . . . . . . . . . . . 165Finding devices based on a known logical HCA . . . . . . . . . . . . . . . . . . . . 166Finding devices based on a known physical switch port . . . . . . . . . . . . . . . . . 168Finding devices based on a known ib interface (ibx/ehcax) . . . . . . . . . . . . . . . . 170

IBM GX HCA physical port mapping based on device number . . . . . . . . . . . . . . . . 172Interpreting switch log formats from vendors . . . . . . . . . . . . . . . . . . . . . 172

Log severities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172Switch chassis management log format . . . . . . . . . . . . . . . . . . . . . . . 173Subnet manager log format. . . . . . . . . . . . . . . . . . . . . . . . . . . 174

Diagnosing problems with a cluster . . . . . . . . . . . . . . . . . . . . . . . . . . 175Diagnosing link errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175Diagnosing and repairing switch component problems . . . . . . . . . . . . . . . . . . 178Diagnosing and repairing IBM system problems. . . . . . . . . . . . . . . . . . . . . 178Diagnosing configuration changes . . . . . . . . . . . . . . . . . . . . . . . . . 178Checking for hardware problems affecting the fabric . . . . . . . . . . . . . . . . . . . 179Checking for fabric configuration and functional problems . . . . . . . . . . . . . . . . . 179Checking InfiniBand configuration in AIX . . . . . . . . . . . . . . . . . . . . . . . 180

Verifying that HCAs are visible to the logical partitions . . . . . . . . . . . . . . . . . 180Verifying that all HCAs are available to the logical partitions . . . . . . . . . . . . . . . 181Verifying that the IP maximum transfer unit (MTU) is configured correctly. . . . . . . . . . . 181Verifying that the network interfaces are recognized as up and available . . . . . . . . . . . 181

Contents v

Page 8: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Checking system configuration in the AIX operating system. . . . . . . . . . . . . . . . . 182Verifying the availability of processor resources . . . . . . . . . . . . . . . . . . . . 182Verifying the availability of memory resources . . . . . . . . . . . . . . . . . . . . 182

Checking InfiniBand configuration in Linux . . . . . . . . . . . . . . . . . . . . . . 182Verifying that HCAs are visible to the logical partitions . . . . . . . . . . . . . . . . . 182Verifying that all HCAs are available to the logical partitions . . . . . . . . . . . . . . . 183Verifying that the IP maximum transfer unit (MTU) is configured correctly. . . . . . . . . . . 184Verifying that the network interfaces are recognized as up and available . . . . . . . . . . . 184

Checking system configuration with Linux . . . . . . . . . . . . . . . . . . . . . . 185Verifying the availability of processor resources . . . . . . . . . . . . . . . . . . . . 185Verifying the availability of memory resources . . . . . . . . . . . . . . . . . . . . 185

Checking multicast groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185Diagnosing swapped HCA ports . . . . . . . . . . . . . . . . . . . . . . . . . . 186Diagnosing swapped switch ports . . . . . . . . . . . . . . . . . . . . . . . . . 187Diagnosing performance problems . . . . . . . . . . . . . . . . . . . . . . . . . 187Diagnosing and recovering ping problems. . . . . . . . . . . . . . . . . . . . . . . 188Diagnosing application crashes . . . . . . . . . . . . . . . . . . . . . . . . . . 188Diagnosing management subsystem problems . . . . . . . . . . . . . . . . . . . . . 189

Determining problems with event management or remote syslogging . . . . . . . . . . . . 189Reconfiguring the CSM event management . . . . . . . . . . . . . . . . . . . . . 196

Recovering from problems with clusters . . . . . . . . . . . . . . . . . . . . . . . . 199Recovering ibx interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

Recovering a single ibx interface in AIX . . . . . . . . . . . . . . . . . . . . . . 199Recovering all the ibx interfaces in a logical partition in the AIX operating system . . . . . . . . 199Recovering an ibx interface tcp_sendspace and tcp_recvspace . . . . . . . . . . . . . . . 200Recovering ml0 in AIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200Recovering InfiniBand Connection Manager (ICM) in AIX . . . . . . . . . . . . . . . . 200

Recovering ehcax interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 201Recovering a single ibx interface in Linux . . . . . . . . . . . . . . . . . . . . . . 201Recovering all of the ibx interfaces in a logical partition in the Linux operating system. . . . . . . 201

Recovering to 4 KB maximum transfer units in the AIX operating system . . . . . . . . . . . . 201Configuring the subnet manager for 4 KB MTU . . . . . . . . . . . . . . . . . . . . 201Setting the host channel adapters (HCAs) to 4 KB MTU . . . . . . . . . . . . . . . . . 202Verifying the 4 KB MTU configuration . . . . . . . . . . . . . . . . . . . . . . . 203

Recovering to 4 KB MTUs in the Linux operating system. . . . . . . . . . . . . . . . . . 204Configuring the subnet manager for 4 KB MTU . . . . . . . . . . . . . . . . . . . . 204Setting up the host channel adapters (HCAs) to 4 KB MTU . . . . . . . . . . . . . . . . 205Verifying the 4 KB MTU configuration . . . . . . . . . . . . . . . . . . . . . . . 205

Reestablishing a health check baseline . . . . . . . . . . . . . . . . . . . . . . . . 207Verifying link FRU replacements . . . . . . . . . . . . . . . . . . . . . . . . . . 207Verifying repairs and configuration changes . . . . . . . . . . . . . . . . . . . . . . 207Restarting the cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208Restarting or powering off an IBM system. . . . . . . . . . . . . . . . . . . . . . . 209Counting devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

Counting switches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210Counting logical switches . . . . . . . . . . . . . . . . . . . . . . . . . . . 211Counting host channel adapters . . . . . . . . . . . . . . . . . . . . . . . . . 211Counting end ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212Counting ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212Counting subnet managers . . . . . . . . . . . . . . . . . . . . . . . . . . . 212Example: Counting devices . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

Handling emergency power off situations . . . . . . . . . . . . . . . . . . . . . . . 213Monitoring and checking for fabric problems . . . . . . . . . . . . . . . . . . . . . . 214

Appendix. Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216Electronic emission notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

Class A Notices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216Terms and conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

vi Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 9: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Safety notices

Safety notices may be printed throughout this guide:v DANGER notices call attention to a situation that is potentially lethal or extremely hazardous to

people.v CAUTION notices call attention to a situation that is potentially hazardous to people because of some

existing condition.v Attention notices call attention to the possibility of damage to a program, device, system, or data.

World Trade safety information

Several countries require the safety information contained in product publications to be presented in theirnational languages. If this requirement applies to your country, a safety information booklet is includedin the publications package shipped with the product. The booklet contains the safety information inyour national language with references to the U.S. English source. Before using a U.S. English publicationto install, operate, or service this product, you must first become familiar with the related safetyinformation in the booklet. You should also refer to the booklet any time you do not clearly understandany safety information in the U.S. English publications.

German safety information

Das Produkt ist nicht für den Einsatz an Bildschirmarbeitsplätzen im Sinne § 2 derBildschirmarbeitsverordnung geeignet.

Laser safety information

IBM® servers can use I/O cards or features that are fiber-optic based and that utilize lasers or LEDs.

Laser compliance

All lasers are certified in the U.S. to conform to the requirements of DHHS 21 CFR Subchapter J for class1 laser products. Outside the U.S., they are certified to be in compliance with IEC 60825 as a class 1 laserproduct. Consult the label on each part for laser certification numbers and approval information.

CAUTION:This product might contain one or more of the following devices: CD-ROM drive, DVD-ROM drive,DVD-RAM drive, or laser module, which are Class 1 laser products. Note the following information:

v Do not remove the covers. Removing the covers of the laser product could result in exposure tohazardous laser radiation. There are no serviceable parts inside the device.

v Use of the controls or adjustments or performance of procedures other than those specified hereinmight result in hazardous radiation exposure.

(C026)

CAUTION:Data processing environments can contain equipment transmitting on system links with laser modulesthat operate at greater than Class 1 power levels. For this reason, never look into the end of an opticalfiber cable or open receptacle. (C027)

CAUTION:This product contains a Class 1M laser. Do not view directly with optical instruments. (C028)

© Copyright IBM Corp. 2009 vii

Page 10: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

CAUTION:Some laser products contain an embedded Class 3A or Class 3B laser diode. Note the followinginformation: laser radiation when open. Do not stare into the beam, do not view directly with opticalinstruments, and avoid direct exposure to the beam. (C030)

Power and cabling information for NEBS (Network Equipment-Building System)GR-1089-CORE

The following comments apply to the IBM servers that have been designated as conforming to NEBS(Network Equipment-Building System) GR-1089-CORE:

The equipment is suitable for installation in the following:v Network telecommunications facilitiesv Locations where the NEC (National Electrical Code) applies

The intrabuilding ports of this equipment are suitable for connection to intrabuilding or unexposedwiring or cabling only. The intrabuilding ports of this equipment must not be metallically connected to theinterfaces that connect to the OSP (outside plant) or its wiring. These interfaces are designed for use asintrabuilding interfaces only (Type 2 or Type 4 ports as described in GR-1089-CORE) and require isolationfrom the exposed OSP cabling. The addition of primary protectors is not sufficient protection to connectthese interfaces metallically to OSP wiring.

Note: All Ethernet cables must be shielded and grounded at both ends.

The ac-powered system does not require the use of an external surge protection device (SPD).

The dc-powered system employs an isolated DC return (DC-I) design. The DC battery return terminalshall not be connected to the chassis or frame ground.

viii Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 11: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Clustering with high-performance computing by usingInfiniBand hardware

You can use this information to guide you through the process of planning, installing, managing, andservicing high-performance computing (HPC) clusters that use InfiniBand hardware.

This information serves as a navigation aid through the processes and publications that are required toinstall hardware units, firmware, operating systems, software and applications that comprise an HPCcluster environment. This information also provides configuration settings and the installation order forthe cluster environment. Typical management and service procedures are also provided.

This information is not intended to replace the existing IBM or vendor-supplied publications for thevarious hardware units, firmware, operating systems, software or applications. These publications arereferenced throughout this information.Hardware Information Web site at

http://publib.boulder.ibm.com/infocenter/systems/scope/hw/topic/iphdx/power_systems.htm .

Overview of clustering systems by using InfiniBand hardwareThis information provides planning and installation information to help guide you through the process ofinstalling a cluster fabric that incorporates InfiniBand switches.

IBM server hardware supports clustering through InfiniBand host channel adapters (HCAs) and switches.Information about how to manage and service a cluster by using InfiniBand hardware is included in thisinformation.

The following figure shows servers that are connected in a cluster configuration with InfiniBand switchnetworks (fabric). The servers in these networks can be connected through switches that use IBM GXHCAs. In System p® Blade servers, the HCAs are based on PCI Express® (PCIe).

Notes:

1. Switch refers to the InfiniBand technology switch unless otherwise noted.2. Not all configurations support the following network configuration. See your IBM sales information

for supported configurations.

© Copyright IBM Corp. 2009 1

Page 12: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

High-level view of the cluster implementation processThe following table provides a high-level view of the cluster implementation process and the informationrequired to effectively plan, install, manage, and service your HPC clusters that use InfiniBand hardware.

Table 1. High-level view of the cluster implementation process and associated information

Content Description

“Overview of clustering systems by using InfiniBandhardware” on page 1

Provides references to information resources, anoverview of cluster components, and the supportedcomponent levels.

“Cluster information resources” on page 3 Provides a list of the various information resources forthe key components of the cluster fabric and where theycan be obtained. These information resources are usedextensively during your cluster implementation, so it isimportant to collect the required documents early in theprocess.

“Fabric communications” on page 7 Provides a description of the fabric data flow.

“Management subsystem function overview” on page 14 Provides a description of the management subsytem.

“Supported components in an HPC cluster” on page 27 Provides a list of the supported components andpertinent features, and the minimum shipment levels forsoftware and firmware.

Figure 1. InfiniBand network with four switches and four servers connected

2 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 13: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 1. High-level view of the cluster implementation process and associated information (continued)

Content Description

“Planning for clusters” on page 28 Provides information on planning for the cluster and thefabric.

“Cluster planning overview” on page 29 Provides navigation through the planning process.

“Required level of support, firmware, and devices” onpage 30

Provides the minimum ship level for firmware anddevices and provides a web site to obtain the latestinformation.

“Server planning” on page 32, “Planning InfiniBandnetwork cabling and configuration” on page 32, and“Management subsystem planning” on page 37

Provides the planning requirements for the mainsubsystems.

“Planning installation flow” on page 50 Provides guidance in how the various tasks relate to eachother and who is responsible for the various planningtasks for the cluster. This information also illustrates howcertain tasks are prerequisites to other tasks. This willassist you in coordinating the activities of the installationteam.

“Planning work sheets” on page 60 Provides planning work sheets that are used to plan theimportant aspects of the cluster fabric. If you are usingyour own work sheets, they should cover the itemsprovided in these work sheets.

Other planning

“Installing a high-performance computing (HPC) clusterwith an InfiniBand network” on page 80

Provides procedures for installing the cluster.

“Managing the cluster fabric” on page 129 Provides the best practices and tasks for managing thefabric.

“Servicing clusters” on page 149 Provides high-level service tasks. This is intended to be alaunch point for servicing the cluster fabric components.

Planning installation work sheets Provides blank copies of the planning work sheets foreasy printing.

Cluster information resourcesInformation resources from IBM and QLogic™ can help you plan for clustering on your InfiniBandnetwork.

The following subtopics list the documentation for your cluster environment and where thedocumentation can be obtained. It also includes information about how the information is used relative tothe clustering tasks: planning, installing, managing, and servicing.

General cluster information resources

The following table lists general cluster resources and the tasks for which you would use thedocumentation.

Table 2. General cluster resources

Item Document Planning Installing Managing andservicing

High-performanceclustering

This document x x x

Clustering with high-performance computing by using InfiniBand hardware 3

Page 14: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 2. General cluster resources (continued)

Item Document Planning Installing Managing andservicing

IBM clusters withthe InfiniBandswitch Web site

Readme file for IBM clusters with the InfiniBandSwitch

See the IBM clusters with the InfiniBand switchWeb site.

QLogicQLogic Best Practices Cluster Guide is initiallyavailable from QLogic support. See the IBMClusters with the InfiniBand Switch Web site forany updates to availability on a QLogic Web site.

x x x

InfiniBandArchitecture

InfiniBand architecture documents and standardspecifications are available from the InfiniBandTrade Association.

HPC Central wikiand HPC Centralforum

The HPC Central wiki enables collaborationbetween customers and IBM teams. The HCPCentral wiki also links to the HPC Central forumwhere customers can post questions andcomments.

x x x

Note: QLogic uses the product name Silverstorm in its documentation.

Cluster hardware information resources

The following table lists cluster hardware resources and the tasks for which you would use thedocumentation.

IBM Power Systems™ documentation is available in the IBM Power Systems Hardware InformationCenter.

The QLogic documentation is initially available from QLogic support. See the IBM Clusters with theInfiniBand Switch Web site for updates that are available on the QLogic Web site.

Any exceptions to the location of information resources for cluster hardware as previously described havebeen noted in the Cluster hardware information resources table.

Table 3. Cluster hardware resources

Function Document Planning Installing Managing andservicing

Site planning for allIBM systems

Site preparation and physical planning x

POWER6® systems Site and hardware planning x

PCI adapters x x

8203-E4A Installing the IBM Power 520 Express (8203-E4A)

Removal and replacement procedures for thePower 520 Express (8203-E4A)

x x

8204-E8A Installing the IBM Power 550 Express (8204-E8A)

Removal and replacement procedures for thePower 550 Express (8204-E8A)

x x

4 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 15: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 3. Cluster hardware resources (continued)

Function Document Planning Installing Managing andservicing

9125-F2A Installation of a 9125-F2A is completed by anIBM service representative. Contact your nextlevel of support.

IBM® Power 575 (9125-F2A) removal andreplacement procedures

x x

Logical partitioningfor all systems

Logical partitioning x

Installation Instructions for IBM logical partitionson System i® and System p

x

BladeCenter® JS22Express

Planning, Installation, and Service Guide x x x

IBM GX HCACustom Installation

Contact your next level of support forinformation on the custom installationinstructions for each HCA feature.

x x x

BladeCenter JS22Express HCA

Users guide for 1350 x x x

Pass-through module 1350 documentation x x x

Fabric managementserver

IBM System x® 3550 and 3650 documentation

Management nodeHCA

HCA vendor documentation x x x

QLogic switches [Switch model] Users Guide x x x

[Switch model] Quick Setup Guide x x

QLogic InfiniBand Cluster Planning Guide x x

QLogic InfiniBand Cluster Troubleshooting Guide x

QLogic 9000 CLI Reference Guide x x

Note: QLogic uses the product name Silverstorm in its documentation.

Cluster management software information resources

The following table lists cluster management software resources and the tasks for which you would usethe documentation.

IBM Power Systems documentation is available in the IBM Power Systems Hardware Information Center.

The IBM CSM documentation is available as follows:v For the product library, go to Cluster Systems Management (CSM).v For online documentation go to IBM Cluster Information Center.

The QLogic documentation is initially available from QLogic support. See the IBM Clusters with theInfiniBand Switch Web site for updates that are available on the QLogic Web site.

Table 4. Cluster management software resources

Function Document Planning Installing Managing andservicing

QLogic SubnetManager

Fabric Manager and Fabric Viewer UsersGuide

x x x

Clustering with high-performance computing by using InfiniBand hardware 5

Page 16: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 4. Cluster management software resources (continued)

Function Document Planning Installing Managing andservicing

QLogic Fast FabricToolset

Fast Fabric Toolset Users Guide x x x

QLogic InfiniServStack

InfiniServ Fabric Access Software Users Guide x x x

HardwareManagementConsole (HMC)

Installing and configuring the HardwareManagement Console

x x

Managing the Hardware Management Console x

Cluster SystemsManagement (CSM)

Cluster Systems Management: Planning andInstallation Guide

x x

Cluster Systems Management: AdministrationGuide

x

Cluster Systems Management: Command andTechnical Reference

x

Cluster software and firmware information resources

The following table lists cluster software and firmware resources and the tasks for which you would usethe documentation.

Table 5. Cluster software and firmware resources

Function Document Planning Installing Managing andservicing

AIX®AIX Information Center x x x

Linux Obtain information from your Linuxdistribution source

x x x

IBM HPC ClustersSoftware

GPFS™: Concepts, Planning, and InstallationGuide

x x

GPFS: Administration and ProgrammingReference

x x

GPFS: Problem Determination Guide x

GPFS: Data Management API Guide x

Tivoli® Workload Scheduler LoadLeveler®:Installation Guide

x x

Tivoli Workload Scheduler LoadLeveler: Usingand Administering

x

Tivoli Workload Scheduler LoadLeveler:Diagnosis and Messages Guide

x x

Parallel Environment: Installation x x

Parallel Environment: Messages x x

Parallel Environment: Operation and Use,Volumes 1 and 2

x

Parallel Environment: MPI ProgrammingGuide

x

Parallel Environment: MPI SubroutineReference

x

6 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 17: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

The IBM HPC Clusters Software Information can be found at the IBM Cluster Information Center.

Fabric communicationsThis information provides a description of fabric communications and the main components that are partof the application data flow.

For more specific documentation references, see “Cluster information resources” on page 3.

The following items are the main components in the fabric data flow.v IBM GX or GX+ host channel adapterv Vendor switchesv Cablesv Subnet managerv POWER Hypervisor™

v IBM device driversv Non-IBM device driversv IBM host stack

The following figure shows the main components of the fabric data flow.

Figure 2. Main components in fabric data flow

Clustering with high-performance computing by using InfiniBand hardware 7

Page 18: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

The following figure shows the high-level software architecture.

The following figure shows a simple InfiniBand configuration illustrating the tasks, the software layers,the windows, and the hardware. The host channel adapter (HCA) shown is intended to be a single HCAcard with four physical ports. However, the figure could also be interpreted as a collection of physicalHCAs and a port; for example, two cards, each with two ports.

Figure 3. High-level software architecture

8 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 19: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

To gain a better understanding of InfiniBand fabrics, see the following documentation:v The InfiniBand standard specification from the InfiniBand Trade Association.v Documentation from the switch vendor

IBM GX or GX+ host channel adaptersThe IBM GX or GX+ host channel adapter (HCA) provides server connectivity to InfiniBand fabrics.

When you attach an adapter to a GX or GX+ bus, you can gain higher bandwidth to and from theadapter. By attaching the adapter to a GX or GX+ bus, you also can gain better network performancethan attaching an adapter to a PCI bus. Because of server form factors, including GX or GX+ bus design,each server that supports an IBM GX or GX+ HCA has its own HCA feature.

The GX or GX+ HCA can be shared between logical partitions. Each physical port can be used by eachlogical partition.

The adapter is logically structured as one logical switch connected to each physical port by using alogical host channel adapter (LHCA) for each logical partition. The following figure shows a single,physical, two-port HCA. This configuration has a single chip that can support two ports. A four-portHCA has two chips with a total of four logical switches that has two logical switches in each of the twochips.

Figure 4. Simple configuration with InfiniBand

Clustering with high-performance computing by using InfiniBand hardware 9

Page 20: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

The logical structure affects how the HCA is represented to the subnet manager. Each logical switch andLHCA represent a separate InfiniBand node to the subnet manager on each port. Each LHCA connects toall logical switches in the HCA.

Each logical switch has a port globally unique identifier (GUID) for the physical port and a port GUIDfor each LHCA. Each LHCA has two port GUIDs, one for each logical switch.

The number of nodes that can be presented to the subnet manager is a function of the maximum numberof LHCAs that are assigned. This is a configurable number for POWER6 GX HCAs, and it is a fixednumber for POWER5™ processor-based servers GX HCAs. The POWER Hypervisor communicates withthe subnet manager by using the Subnet Management Agent (SMA) function in the POWER Hypervisor.

The POWER6 GX HCA supports a single LHCA by default. In this case, the GX HCA presents eachphysical port to the subnet manager as a two-port logical switch. One port is connected to the LHCA andthe second port is connected to the physical port. The POWER6 GX HCA can also be configured tosupport up to 16 LHCAs. In this case, the HCA presents each physical port to the subnet manager as a17-port logical switch with up to 16 LHCAs. Ultimately, the number of ports for a logical switch isdependent on the number of logical partitions concurrently by using the GX HCA.

The POWER5 GX HCA supports up to 64 LHCAs. In this case, the GX HCA presents each physical portto the subnet manager as a 65-port logical switch. One port connects to the physical port and 64 portsconnect to LHCAs. As compared to how it works on POWER6 processor-based systems, for POWER5processor-based systems, it does not matter how many LHCAs are defined and used by logical partitions.The number of nodes presented includes all potential LHCAs for the configuration; therefore, eachphysical port on a GX HCA in a POWER5 processor-based system presents itself as a 65-port logicalswitch.

The Hardware Management Console (HMC) that manages the server, in which the HCA is populated, isused to configure the virtualization capabilities of the HCA. For systems that are not managed by anHMC, configuration and virtualization are done by using the Integrated Virtualization Manager (IVM).

Each logical partition is only aware of its assigned LHCA. For each logical partition profile, a GUID isselected with an LHCA. The GUID is programmed in the adapter and cannot be changed.

Figure 5. Two-port GX or GX+ host channel adapter

10 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 21: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Since each GUID must be different in a network, the IBM HCA gets a subsequent GUID assigned by thefirmware. You can choose the offset that is used for the LHCA. This information is also stored in thelogical partition profile on the HMC.

Therefore, when an HCA is replaced, each logical partition profile must be manually updated with thenew HCA GUID information. If this step is not performed, the HCA is not available to the operatingsystem.

The following table describes how the HCA resources are allocated to a logical partition. This ability toallocate HCA resources allows multiple logical partitions to share a single HCA. The degree of sharing isdriven by your application requirements.

The Dedicated value is only used when you have a single, active logical partition that needs to use all theavailable HCA resources. You can configure multiple logical partitions to be dedicated, but only one canbe active at a time.

When you have more than one logical partition sharing an HCA, you can assign a particular allocation toit. You can never allocate more than 100% of the HCA across all active logical partitions. For example,four active logical partition could be set to medium and two active logical partitions could be set to High;(4x1/8) + (2x1/4) = 1.

If the requested resource allocation for a logical partition exceeds the available resource for an HCA, thelogical partition is not activated. In the previous example with six active logical partitions, if one morelogical partition tries to activate and uses the HCA, the logical partition is not activated because the HCAis already 100% allocated.

Table 6. Allocation of HCA resources to a logical partition

Value Resulting resource allocation for each adapter

Dedicated All the adapter resources are dedicated to the logical partition. Thisrule is the default for a single logical partition, which is the supportedHPC cluster configuration.

If you have multiple active logical partitions, you cannotsimultaneously dedicate the HCA to more than one active logicalpartition.

High One-quarter of the maximum adapter resources are dedicated to thelogical partition.

Medium One-eighth of the maximum adapter resources are dedicated to thelogical partition.

Low One-sixteenth of maximum adapter resources are dedicated to thelogical partition.

Logical switch naming convention:

The IBM GX host channel adapters (HCAs) have a logical switch naming convention based on the servertype and the HCA type.

The following table shows the logical switch naming convention.

Table 7. Logical switch naming convention

Server HCA chip base Logical switch name

POWER5 Any IBM Logical Switch 1 or IBM LogicalSwitch 2

Clustering with high-performance computing by using InfiniBand hardware 11

Page 22: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 7. Logical switch naming convention (continued)

Server HCA chip base Logical switch name

System p (POWER6) First generation IBM G1 Logical Switch 1 or IBM G1Logical Switch 2

System p (POWER6) Second generation IBM G2 Logical Switch 1 or IBM G2Logical Switch 2

Host channel adapter statistics counter:

The statistics counters in the IBM GX host channel adapters (HCAs) are only available with HCAs inPOWER6 processor-based servers.

You can query the counters by using Performance Manager functions with the Fabric Viewer and the fastfabric iba_report command.

While the HCA tracks most of the prescribed counters, it does not have counters for transmit packets orreceive packets.Related reference

“Tips: Using iba_report” on page 147The iba_report function helps you to monitor the cluster fabric resources.

Vendor switchesVendor switches are used as the backbone of the communications fabric in an IBM high-performancecomputing (HPC) cluster by using InfiniBand technology.

The switches used in IBM HPC clusters are based on the 24-port Mellanox chip.

QLogic switches supported by IBMIBM supports QLogic switches in high-performance computing (HPC) clusters.

The following QLogic switch models are supported. For more details on the models, see the QLogicdocumentation and the user's guide for the switch model, which are available at http://www.qlogic.comor contact QLogic support.

Note: QLogic uses the product name SilverStorm in their product documentation.v QLogic 9024CU Managed 24-port DDR InfiniBand Switchv QLogic 9040 48-port DDR InfiniBand Switchv QLogic 9080 96-port DDR InfiniBand Switchv QLogic 9120 144-port DDR InfiniBand Switchv QLogic 9240 288-port DDR InfiniBand Switch

CablesIBM supports specific cables for high-performance computing (HPC) cluster configurations.

The following table describes the cables that are supported for IBM HPC configurations.

12 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 23: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 8. Cables for high-performance computing configurations

System or use Cable type Connector type Length - m (ft) Source Comments

9125-F2A 4X DDR, copper QSFP - CX4 6 m (19.7 ft)(passive, 26 awg)

10 m (32.8 ft)(active, 26 awg)

14 m (46 ft)(active, 30 awg)

Vendor

8204-E8A

8203-E4A

12X - 4X DDRwidth exchanger,copper

CX4 - CX4 3 m (9.8 ft)

10 m (32.8 ft)

IBM Link operates at4X speed.

JS22 4X DDR, copper CX4 - CX4 Multiple lengths Vendor To connectbetween PTMand switch.

Intra-rack 4X DDR, copper CX4 - CX4 Multiple lengths Vendor For use betweenswitches.

Fabricmanagementserver

4X DDR, copper CX4 - CX4 Multiple lengths Vendor For connectingthe fabricmanagementserver to subnetsto supporthost-based subnetmanager and FastFabric Toolset.

Subnet managerThe subnet manager is used to configure and manage the communication fabric so that it can send data.

The subnet manager is defined by the InfiniBand standard specification. Management functions areperformed inband over the same links as the data.

A host-based subnet manager (HSM) scales better than an embedded subnet manager and has beenverified and approved by IBM. The host-based subnet manager can be used to run a fabric managementserver.

For more information about subnet managers, see the InfiniBand standard specification or vendordocumentation.Related concepts

“Management subsystem function overview” on page 14This information provides an overview of the servers, consoles, applications, firmware, and networks thatcomprise the management subsystem function.

POWER HypervisorThe POWER Hypervisor provides an abstraction layer between the hardware and firmware and theoperating system instances for GX host channel adapter (HCA) implementations.

POWER Hypervisor provides the following functions to use POWER6 GX HCA implementations.v UD low latency receive queuesv Large page memory sizesv Shared receive queues (SRQ)v Support for more than 16 KB queue pairs. The exact number of queue pairs is determined by cluster

size and available system memory.

Clustering with high-performance computing by using InfiniBand hardware 13

Page 24: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

POWER Hypervisor also contains the Subnet Management Agent (SMA) to communicate with the subnetmanager and present the HCA as logical switches with a given number of ports attached to the physicalports and to logical HCAs (LHCAs).

POWER Hypervisor also contains the Performance Management Agent (PMA), which is used tocommunicate with the performance manager that collects fabric statistics, such as link statistics, includingerrors and link usage statistics.

For more information about SMA and PMA function, see the InfiniBand architecture documentation.Related concepts

“IBM GX or GX+ host channel adapters” on page 9The IBM GX or GX+ host channel adapter (HCA) provides server connectivity to InfiniBand fabrics.

Device driversIBM provides device drivers for the AIX operating system. Vendor companies provide device drivers forthe Linux operating system.

IBM device drivers

IBM provides device drivers, which are used in the AIX operating system.

Vendor device drivers

Vendor device drivers for the Linux operating system are available from the distributors.

Vendor device drivers are not supported on IBM Power Systems high-performance computing (HPC)clusters that use the AIX operating system. The vendor provides the device driver that is used on FabricManagement Servers.Related concepts

“Management subsystem function overview”This information provides an overview of the servers, consoles, applications, firmware, and networks thatcomprise the management subsystem function.

IBM host stackThe high-performance computing (HPC) software stack is supported for System p servers and IBM PowerSystems servers that are running AIX or Linux and have HPC clusters.

The vendor host stack is used on fabric management servers.Related concepts

“Management subsystem function overview”This information provides an overview of the servers, consoles, applications, firmware, and networks thatcomprise the management subsystem function.

Management subsystem function overviewThis information provides an overview of the servers, consoles, applications, firmware, and networks thatcomprise the management subsystem function.

The management subsystem is a collection of servers, consoles, applications, firmware, and networks thatwork together to provide the following functions.v Installing and managing the firmware on hardware devicesv Configuring the devices and the fabricv Monitoring for events in the clusterv Monitoring status of the devices in the cluster

14 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 25: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

v Recovering and routing around failure scenarios in the fabricv Diagnosing the problems in the cluster

IBM and vendor system and fabric management products and utilities can be configured to worktogether to manage the fabric.

Review the following information to better understand InfiniBand fabrics.v The InfiniBand standard specification from the InfiniBand Trade Association. Read the information

about managers.v Documentation from the switch vendor. Read the Fabric Manager and Fast Fabric Toolset

documentation.Related concepts

“Cluster information resources” on page 3Information resources from IBM and QLogic™ can help you plan for clustering on your InfiniBandnetwork.

Management subsystem integration recommendationsCluster Systems Management (CSM) is the IBM Systems Management tool that provides the integrationfunction for InfiniBand fabric management.

The integration uses existing functions within CSM. The major advantages of CSM in a cluster are asfollows.v The ability to issue remote commands to many nodes and devices simultaneously.v The ability to consolidate logs and events from many sources in a cluster by using event management.

For more information about the functions and advantages of CSM, see the CSM documentation.

QLogic provides the following switch and fabric management tools.v Fabric Managerv Fast Fabric Toolsetv Chassis View7014-S11erv Switch command-line interfacev Fabric Viewer

Managed switch models are used in IBM Power Systems servers and System p servers that havehigh-performance computing (HPC) clusters.

Management subsystem high-level functionsSeveral high-level functions address management subsystem integration.

To address management subsystem integration, functions for management are divided into the followingcategories.v Monitoringv Maintainingv Diagnosingv Connecting

Monitoring

You can use the following functions to monitor the state and health of the fabric:

Clustering with high-performance computing by using InfiniBand hardware 15

Page 26: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

1. A method to get asynchronous events that indicate status and configuration changes into the ClusterSystems Management (CSM) event management subsystem is provided. This function is achieved byforwarding syslog entries from vendor subnet managers and switches to the CSM Management Server(CSM/MS).

2. The remote syslog entries that arrive at the CSM/MS are directed to a file or named pipe based onthe priority or severity of the log entry.a. Notice and higher entries are sent to a file or named pipe that is monitored by CSM event

management.b. Information and higher entries are sent to another file for historical and detailed debugging

purposes. This is an optional but recommended approach.3. Event management uses the AIXSyslogSensor log entries for CSM that is running on the AIX

operating system, and the ErrorLogSensor log entries for CSM that is running on the Linux operatingsystem.

4. Event management places the notice and higher log entries in the common area for error logs fromoperating systems.

5. The QLogic Fast Fabric Toolset health checking tools can be used for regularly monitoring the fabricfor errors and configuration changes that could lead to performance problems.a. A baseline health-check is taken upon installation and configuration change.b. The baseline is used to compare against the current state and to indicate any undesired

differences.

Maintaining

You can use the dsh command in CSM to maintain the fabric. The dsh command uses existing vendorcommand-line tools remotely from the CSM/MS. The tools that provide this function are as follows.1. Switch chassis command-line interface (CLI) on a managed switch. Some new dsh options and

hardware device command profiles allow the dsh command to work with the proprietary switch CLI.For more information, see “Setting up remote command processing” on page 103 and “Remotelyaccessing QLogic management tools and commands from CSM/MS” on page 143.

2. Subnet manager running in a switch chassis or on a host.3. Fast Fabric tools running on a fabric management server or host. This host is an IBM System x server

that is running on the Linux operating system and the host stack from the vendor.

Diagnosing

You can use the following vendor tools to diagnose and check the health of the fabric:1. The QLogic Fast Fabric Toolset running on the Fabric Management Server or Host provides the main

diagnostic capability.2. The QLogic Fast Fabric Toolset health-checking tool is important when no clear events indicate a

specific problem, but you observe a degradation in performance. The indicators of a problem in thefabric can include:v Errors that were previously undetectedv Configuration changes, including a missing resource

3. You can access vendor diagnostic tools by using the CSM dsh command.

Connecting

For connectivity, the CSM/MS must be on the same cluster virtual local area network (VLAN) as theswitches and the management servers running the Subnet Managers and Fast Fabric Tools.

16 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 27: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Management subsystem overviewThe management subsystem in the high-performance computing (HPC) cluster solution uses anInfiniBand fabric that loosely integrates the typical IBM System p or IBM Power Systems server HPCcluster components with the QLogic components.

The management subsystem can be viewed from several perspectives, including:v Host viewsv Networksv Functional componentsv Users and interfaces

Figure 6 on page 18 shows the use of a host-based subnet manager (HSM), rather than an embeddedsubnet manager (ESM), running on a switch. Because a host-based Subnet Manager scales better than anembedded subnet manager and has been verified and approved by IBM, the HSM can be used to run afabric management server.

The following figure illustrates the functions of the management or service subsystem.

Clustering with high-performance computing by using InfiniBand hardware 17

Page 28: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

The servers are monitored and serviced in the same fashion as for any IBM Power Systems cluster.

The Cluster Systems Management (CSM) Management Server (CSM/MS) is the central point formanaging and monitoring operations for the system administrator. The CSM functions for eventmanagement of the switch and subnet manager events, and for remote command processing to theswitches and fabric management server can be used from the CSM/MS. However, the systemadministrator can also perform these functions by directly logging on to the switches or to the FabricManagement Servers or hosts.

The following table is a quick reference for the various management hosts or consoles in the cluster, theintended user (for example, the system administrator or switch service provider), and the networks towhich the hosts or consoles are connected.

Figure 6. Management subsystem

18 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 29: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 9. Management subsystem server, consoles, and workstations

Hosts Software hosted Server type Operating system User Connectivity

CSM/MS Cluster SystemsManagement(CSM)

IBM System p

IBM System x

AIX

Linux

Systemadministrator

v Cluster virtuallocal areanetwork(VLAN)

v Service VLAN

Fabricmanagementserver

v Fast FabricTools

v Host-basedFabric Manager(recommended)

v Fabric viewer(optional)

System x Linux v Systemadministrator

v Switch serviceprovider

v InfiniBand

v Cluster VLAN(same asswitches)

HardwareManagementConsole (HMC)

HMC formanaging IBMsystems

System x Proprietary 1. IBM servicerepresentative

2. Systemadministrator

1. Service VLAN

2. Cluster VLANor publicVLAN(optional)

Switch v Chassisfirmware

v Chassis viewer

v EmbeddedFabric Manager(optional)

Switch chassis Proprietary v Systemadministrator

v Switch serviceprovider

Cluster VLAN(Chassis viewerrequires publicnetwork access)

Systemadministratorworkstation

v Systemadministratorworkstation

v Fabric viewer(optional)

v Launch pointintomanagementserversNote: Thislaunch pointrequiresnetwork accessto otherservers.(optional)

User preference User preference v Systemadministrator

Network access tomanagementservers

Service notebook Serial interface toswitchNote: The serialinterface to theswitch is notprovided by IBMas part of thecluster. It isprovided by theuser or the site.

Notebook User experience v Switch serviceprovider

v Systemadministrator

RS/232

NTP server NTP Site preference Site preference Not applicable v Cluster VLAN

v Service VLAN

Clustering with high-performance computing by using InfiniBand hardware 19

Page 30: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Switch chassis viewer:

The switch chassis viewer is a tool that is used to configure a switch and query the state of the switch.

The following table provides an overview of the switch chassis viewer.

Table 10. Switch chassis viewer overview

Switch chassis viewer Details

Description The switch chassis viewer is a tool for configuring a switch and for querying the state ofthe switch. It is also used to access the embedded Fabric Manager. Because it only workswith one switch at a time, you must use the Fast Fabric Toolset and Cluster SystemManagement (CSM) to work with multiple switches or multiple Fabric Managerssimultaneously.

Documentation Switch Users Guide

When to use After the configuration setup is completed, the chassis viewer is only used as part ofdiagnostics after the Fabric Viewer or Fast Fabric tools have been used and have isolateda problem to a chassis.

Host Switch chassis

How to access The chassis viewer is accessible through any browser on a server that is connected to theEthernet network to which the switch is attached. The switch Internet Protocol (IP)address is the URL that starts the chassis viewer.

Switch command-line interface:

Use the switch command-line interface (CLI) for configuring switches and querying the state of a switch.

The following table provides an overview of the switch chassis viewer.

Table 11. Switch CLI overview

Switch command-lineinterface Details

Description Use the switch CLI to configure switches, to query the state of switches, and to accessthe embedded Subnet Manager.

Documentation Switch Users Guide

When to use After the configuration setup has been completed, the CLI chassis viewer is used as partof diagnostic testing after the fabric viewer or fast fabric tools have been used. However,by using Cluster Systems Management/Management Server (CSM/MS) dsh or Expectcommands, remote scripts can access the CLI for creating customized monitoring andmanagement scripts.

Host Switch Chassis

How to access v Use Telnet or ssh to access the switch by using its Internet Protocol (IP) address onthe cluster virtual local area network (VLAN)

v The Fast Fabric Toolset

v The dsh command from the CSM/MS

v A notebook connected to the RS/232 port

Fabric manager:

The Fabric Manager is used to complete basic operations such as fabric discovery, fabric configuration,fabric monitoring, fabric reconfiguration after failure, and reporting problems.

20 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 31: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

The following table provides an overview of the Fabric Manager.

Table 12. Fabric manager overview

Fabric manager Details

Description The Fabric Manager performs the following basic operations:

v Discovers fabric devices

v Configures the fabric

v Monitors the fabric

v Reconfigures the fabric on failure

v Reports problems

The Fabric Manager has several management interfaces that are used to manage anInfiniBand network. These interfaces include the baseboard manager, performancemanager, subnet manager, and fabric executive. All but the fabric executive are describedin the InfiniBand architecture. The fabric executive provides an interface between thefabric viewer and the other managers. Each of these managers is required to fullymanage a single subnet. If you have a host-based Fabric Manager, up to four FabricManagers can be on the Fabric Manager Server. Configuration parameters for eachinstance of the Fabric Manager must be considered. Typically, only a few of the manyparameters vary from the default.

A more detailed description of fabric management is available in the InfiniBand standardspecification and vendor documentation.

Documentation v QLogic Fabric Manager Users Guide

v InfiniBand standard specification

When to use Fabric management must be enabled to manage the network and send data. Use theswitch chassis viewer, command-line interface (CLI), or fabric viewer to interact with theFabric Manager.

Host v Host-based Fabric Manager is on the fabric management server.

v Embedded Fabric Manager is on the switch.

How to access You can access the Fabric Manager functions from Cluster Systems Management (CSM)by issuing remote commands by using the dsh command to the fabric managementserver or by using the switch on which the embedded Fabric Manager is running. Youcan access many instances simultaneously by using the dsh command.

For host-based Fabric Managers, log on to the fabric management server.

For embedded Fabric Managers, use the switch chassis viewer, switch CLI, Fast FabricToolset, or fabric viewer to interact with the fabric manager.

Fast Fabric Toolset:

The QLogic Fast Fabric Toolset is a set of scripts that are used to manage switches and to obtaininformation about the switch status.

The following table provides an overview of the Fast Fabric Toolset.

Clustering with high-performance computing by using InfiniBand hardware 21

Page 32: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 13. Fast Fabric Toolset overview

Fast Fabric Toolset Details

Description Fast Fabric tools are a set of scripts that provide access to switches and the variousmanagers to connect with many switches and managers simultaneously to obtain usefulstatus or information. Additionally, health-checking tools help you to identify fabricerror states and also unforeseen changes from baseline configuration. Health checkingtools are run from a central server called the fabric management server.

These tools can also help manage nodes running the QLogic host stack. The set offunctions that do this are not used with an IBM System p or IBM Power Systemshigh-performance computing (HPC) cluster, because Cluster Systems Management(CSM) is used for systems that are managed in these clusters.

Documentation Fast Fabric Toolset Users Guide

When to use These tools can be used during installation to search for problems. These tools can alsobe used for health checking when you have degraded performance.

Host Fabric management server

How to access v You can use Telnet or ssh to access the fabric management server.

v If you set up the server that is running the Fast Fabric tools as a managed device, youcan send dsh commands to it from CSM.

Cluster Systems Management:

Cluster Systems Management (CSM) is a system administrator tool for monitoring and managing thecluster.

The following table provides an overview of CSM.

Table 14. Cluster Systems Management overview

Description The system administrator uses Cluster Systems Management to monitor and manage thecluster.

Documentation CSM Planning and Install Guide, CSM Administration Guide.

When to use Use CSM to monitor remote logs from the switches and fabric management servers andto remotely run commands on the switches and fabric management servers.

After you have configured the switch, configured the IP addresses of the fabricmanagement server, configured the remote syslog function, and created the switch as adevice, CSM can be used to monitor switch events and use dsh to send commands tothe command-line interface on the switch.

Host CSM Management Server

How to access Use the CLI or the graphical user interface (GUI) on the CSM Management Server.

Hardware Management Console:

You can use the Hardware Management Console (HMC) to manage a group of servers.

The following table provides an overview of the HMC.

Table 15. HMC overview

HMC Details

Description Each HMC is assigned to the management of a group of servers. If there is more thanone HMC in a cluster, this is accomplished by using the Cluster-Ready Hardware Serveron the Cluster Systems Management/Management Server (CSM/MS).

22 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 33: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 15. HMC overview (continued)

HMC Details

Documentation HMC Users Guide

When to use Use the HMC to perform many functions, including:

v To set up and manage logical partitions, including host channel adapter (HCA)virtualization. For details, see the Logical partitioning topic.

v To access serviceable events for HCA and servers.

v To control the server hardware.

Host HMC

How to access Use the HMC console located near the system. There is generally a single keyboard andmonitor with a console switch to access multiple HMCs in a rack (if there is a need formultiple HMCs).

You can also access the HMC through a supported Web browser on a remote server thatcan connect to the HMC.

Managing serviceable events on the HMC: Problems on your managed system are reported to the HMC asserviceable events. You can view the problem, manage problem data, call home the event to your serviceprovider, or repair the problem. On a regular basis, you should review and close any open serviceableevents on the HMC.

Perform the following steps to manage serviceable events on the HMC.1. Open the Manage Serviceable Events task from the Service Management work pane.2. From the Manage Serviceable Events window, provide event criteria, error criteria, and FRU criteria.

Alternatively, you can select All.3. Click OK when you have specified the criteria you want for the serviceable events you want to view.

A table appears with the serviceable events that match your criteria.

Note: Use the online Help if you need additional information managing events.

Service processor:

The service processor is used to facilitate connectivity.

The following table provides an overview of the service processor.

Table 16. Service processor overview

Service processor Details

Description Cluster Systems Management (CSM) and the managing Hardware Management Console(HMC) must be able to communicate with the service processor over the service virtuallocal area network (VLAN). For 9125-F2A servers, this connectivity is facilitated throughan internal hardware VLAN within the frame, which connects to the service VLAN.

Documentation IBM System Users Guide

When to use The service processor is in the background most of the time and the HMC and CSMprovide the information. The service processor is sometimes accessed under the directionof product engineering.

Host IBM system

How to access The service processor is primarily used by service personnel. Direct access is rarelyrequired, and is done under the direction of product engineering by using the AdvancedSystem Management Interface (ASMI). Otherwise, CSM and the HMC are used tocommunicate with the service processor.

Clustering with high-performance computing by using InfiniBand hardware 23

Page 34: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Network Time Protocol:

The Network Time Protocol (NTP) synchronizes the clocks in the management servers and switches.

The following table provides an overview of the NTP.

Table 17. Network Time Protocol overview

Network Time Protocol Details

Description The NTP is used to keep the switches and management servers time of day clockssynchronized. It is important to ensure the correlation of events in time.

Documentation NTP Users Guide

When to use The NTP is set up during installation.

Host The NTP Server

How to access The administrator accesses the NTP by logging on to the system on which the NTPserver is running. This is done for configuration and maintenance. Normally, this is abackground application.

Fabric viewer:

The fabric viewer is an interface that is used to access the Fabric Management tools.

The following table provides an overview of the fabric viewer.

Table 18. Fabric viewer overview

Fabric viewer Details

Description The fabric viewer is a user interface that is used to access the Fabric Management toolson the various subnets. It is a Linux or Microsoft Windows application.

The fabric viewer must be able to connect to the cluster virtual local area network(VLAN) to connect to the switches. The fabric viewer must also connect to the subnetmanager hosts through the same cluster VLAN.

Host Any Linux or Microsoft Windows host. Typically, these hosts would be one of thefollowing items.

v Fabric management server

v System administrator or operator workstation

Documentation QLogic Fabric Viewer Users Guide

When to use After the switch is configured for communication to the fabric viewer, it can be used asthe main point for queries and interaction with the switches. You can also use the fabricviewer to update the switch code simultaneously to multiple switches in the cluster. Thefabric viewer can also be used during the installation process to set up e-mailnotification for changes in link status and subnet manager and for changes in eventmanagement communication status.

How to access Start the graphical user interface (GUI) from the server on which you install the fabricviewer, or use a remote window access to start it. VNC is an example of a remotewindow access application.

E-mail notifications:

The e-mail notifications function can be enabled to trigger e-mails from the fabric viewer.

The following table provides an overview of e-mail notifications.

24 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 35: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 19. E-mail notifications overview

E-mail notifications Details

Description E-mail notifications is a subset of events that can be enabled to trigger an e-mail fromthe fabric viewer. These notifications relate to link up and down problems andcommunication problems between the fabric viewer and parts of the Fabric Manager.

Typically, fabric viewer is used interactively and then is shut down after a session. Thisprevents the ability to effectively use e-mail notification. If you want to use this function,you must have a copy of fabric viewer running continuously, for example, on the fabricmanagement server.

Documentation Fabric Viewer Users Guide

When to use E-mail notification is set up during installation so that you can be notified of events asthey occur.

Host Wherever Fabric Viewer is running.

How to access Setup for e-mail notification is done on the fabric viewer. The e-mail is accessed fromwherever you have directed the fabric viewer to send the e-mail notifications.

Operating system:

The operating system is the interface with the device drivers.

The following table provides an overview of the operating system.

Table 20. Operating system overview

Operating system details More information

Description The operating system is the interface for the device drivers.

Documentation Operating system users guide

When to use To query the state of the host channel adapters (HCAs) and the availability of the HCAsto applications.

Host IBM system

How to access Use the dsh command from Cluster Systems Management (CSM), or use Telnet or thessh command to access the logical partition.

Management subsystem networks:

The devices in the management subsystem are connected through various networks.

All the devices in the management subsystem are connected to at least two networks over which theirapplications communicate. Typically, the site connects key servers to a local network to provide remoteaccess for managing the cluster. The networks are shown in the following table.

Table 21. Management subsystem networks overview

Type of network Details

Service VLAN The service virtual local area network (VLAN) is a private Ethernet network thatprovides connectivity between the service processors, bulk power adapters (BPAs),Cluster Systems Management/Management Server (CSM/MS), and the HardwareManagement Console (HMC) to facilitate hardware control. The CSM documentationrefers to this type of network as service VLAN or management VLAN.

Clustering with high-performance computing by using InfiniBand hardware 25

Page 36: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 21. Management subsystem networks overview (continued)

Type of network Details

Cluster VLAN The cluster VLAN (or network) is an Ethernet network (public or private) that givesCSM access to the operating systems. It is also used for access to InfiniBand switchesand fabric management servers. CSM documentation refers to this type of network asthe cluster VLAN.Note: The switch vendor documentation refers to cluster VLAN as the service VLAN orthe management network.

Public network A local site Ethernet network is typically attached to the CSM/MS and fabricmanagement server. You might choose to put the cluster VLAN on the public network.Refer to CSM installation and planning documentation to consider the implications ofcombining a local and a public network.

Internal hardware VLAN The internal hardware VLAN connection is a VLAN within a frame of 9125-F2A servers.The internal hardware VLAN combines all of the service processor connections and theBPH connections onto an internal Ethernet hub, which provides a single connection tothe service VLAN, which is external to the frame.

Vendor log flow to CSM event managementThe integration of vendor and IBM log flows is a critical factor in event management.

One of the important points of integration for vendor and IBM management subsystems is log flow fromvendor management applications to Cluster System Management (CSM) event management. Thisintegration provides a consolidated logging point in the cluster. The flow of log information is shown inthe following figure. For this integration to work, you must set up remote logging and CSM eventmanagement with the Fabric Management Server and the switches as described in “Setting up remotelogging” on page 95.

The figure indicates where remote logging and CSM event management must be enabled for the flow towork.

26 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 37: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Supported components in an HPC clusterHigh-performance computing (HPC) clusters are implemented by using components that are approvedand supported by IBM.

For details, see “Cluster information resources” on page 3.

The following table indicates the components or units that are supported in an HPC cluster.

Table 22. Supported HPC components

Component type Component Model, feature, or minimum level

POWER6 processor-based servers 2U high-end server 9125-F2A

High volume server 4U high 8203-E4A

8204-E8A

Blade Server 7988J22

Operating system AIX 5L™ AIX 5L Version 5.3 with the 5300-08Technology Level with Service Pack 2

SUSE Linux Enterprise Server (SLES) SLES 10 SP2 Kernel Level2.6.16.60-0.14-ppc64 with Service Pack2 InfiniBand Device Driver

Figure 7. Vendor log flow to CSM event management

Clustering with high-performance computing by using InfiniBand hardware 27

Page 38: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 22. Supported HPC components (continued)

Component type Component Model, feature, or minimum level

Switch QLogic v QLogic 9024CU Managed 24-portDDR InfiniBand Switch

v QLogic 9040 48-port DDRInfiniBand Switch

v 9080

v 9140

v QLogic 9240 288-port DDRInfiniBand Switch

IBM GX host channel adapters(HCAs)

IBM GX+ HCA for 9125-F2A

IBM GX HCA for 8203-E4A and8204-E8A

JS22 HCA Mellanox 4X Connect-X HCA 8258

JS22 Pass-through module Voltaire High Performance InfiniBandQSFP to CX4 Pass-Through Modulefor IBM BladeCenter

3216

Cable CX4 to CX4 7978AC1

QSFP to CX4 7979AC1

Management node for InfiniBandfabric

IBM System x 3550 (1U high)

IBM System x 3650 (2U high)

HCA for management node QLogic Dual-Port 4X DDR

InfiniBand PCIe HCA

Fabric Manager QLogic host-based Fabric Manager(embedded not recommended)

4.2.1.1.1

Quicksilver or InfiniServ host stackwith Fast Fabric

QLogic host stack and Fast FabricToolset

4.2.0.2.1

Switch firmware QLogic firmware for the switch 4.2.1.1.1

Cluster System Management (CSM) AIX 1.7.0.13 with APAR IZ23836

Linux 1.7.0.13 with APAR IZ23836

Hardware Management Console(HMC)

HMC V7R3.3.0 HMC build level 20080518.1MH01105_0519

Planning for clustersThis information covers the key elements for planning a cluster that uses InfiniBand technologies for thecommunications fabric.Related concepts

“Cluster information resources” on page 3Information resources from IBM and QLogic™ can help you plan for clustering on your InfiniBandnetwork.

Getting started with cluster planningWhen planning a cluster with an InfiniBand network, you bring together many different devices andmanagement tools to form a cluster.

The following major components are part of a cluster.

28 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 39: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

v Serversv I/O devicesv InfiniBand network devicesv Frames (racks)v Service virtual local area networks (VLANs) include the following items:

– Hardware Management Console (HMC)– Ethernet devices– Cluster Systems Management (CSM) server (for environments with multiple HMCs)

v Management networks include the following items:– A CSM server– Servers to provide operating system access from the CSM– InfiniBand switches– A fabric management server– A Network Installation Management (NIM) server (for AIX servers with no removable media

capabilities on which you want to run stand-alone diagnostics)– A distribution server (for Linux servers with no removable media capabilities on which you want to

run stand-alone diagnostics)v System management applications include the following items:

– HMC– CSM– Fabric Manager– Other QLogic management tools such as Fast Fabric Toolset, Fabric Viewer and Chassis Viewer

v Physical characteristics, such as weight and dimensionsv Electrical characteristicsv Cooling characteristics

Cluster planning overviewUse this information as a road map to guide you through the cluster planning process.

If you read through the cluster planning overview without following the links, you will gain anunderstanding of the overall cluster planning strategy. Then you can follow the links that direct youthrough the different procedures to gain an in-depth understanding of the cluster planning process.

The planning procedures are arranged in a sequential order for a new cluster installation. If you are notinstalling a new cluster, you might need to choose which procedures to use. However, you should stillperform them in the order they appear in the Cluster planning overview.

To plan your cluster, complete the following tasks:1. Gather and review the planning and installation information for the components in the cluster. See

“Cluster information resources” on page 3 as a starting point for where to obtain the information.This information provides supplemental documentation with respect to clustered computing with anInfiniBand network. You must understand all of the planning information for the individualcomponents before continuing with this planning overview.

2. Review the “Planning check list” on page 59, which can help you track the planning steps that youhave completed.

3. Review the “Required level of support, firmware, and devices” on page 30 to understand theminimal level of software and firmware required to support clustering with an InfiniBand network.

4. Review the planning resources for the individual servers that you want to use in your cluster. See“Server planning” on page 32.

Clustering with high-performance computing by using InfiniBand hardware 29

Page 40: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

5. Review “Planning InfiniBand network cabling and configuration” on page 32 to understand thenetwork devices and configuration. The planning information addresses the following items:v “Planning InfiniBand network cabling and configuration” on page 32.v “Configuring an IBM GX host channel adapter” on page 36. For vendor host channel adapter

(HCA) planning, use the vendor documentation.6. Review the “Management subsystem planning” on page 37. The management subsystem planning

addresses the following items.v Understanding when Cluster Systems Management (CSM) is neededv Learning how the Hardware Management Console works in a clusterv Learning about Network Installation Management (NIM) servers (for AIX) and distribution servers

(for Linux)v “Planning CSM as your systems management application” on page 39v “Planning for QLogic fabric management applications” on page 40v “Planning for the fabric management server” on page 47v “Planning event monitoring with QLogic and CSM” on page 48v “Planning to run remote commands with QLogic from the CSM/MS” on page 49

7. When you understand the devices in your cluster, review “Frame planning” on page 50, to ensurethat you have appropriately planned where to put devices in your cluster.

8. After you understand the basic concepts for planning the cluster, review the high-level installationflow information in “Planning installation flow” on page 50. It contains hints about planning yourinstallation, and contains guidelines to help you to coordinate between your responsibilities, and theIBM service representative responsibilities, and vendor responsibilities during the installationprocess.

9. Consider special circumstances such as whether you are configuring applications that usemessage-passing-interface (MPI) in a high-performance computing (HPC) environment. For moreinformation, see “Planning for an HPC MPI configuration” on page 58.

10. For more hints and tips on installation planning, see “Tips for planning cluster hardware” on page58.

If you have completed all the previous steps, you can plan in more detail by using the planning worksheets provided in “Planning work sheets” on page 60.

When you are ready to install the components with which you plan to build your cluster, reviewinformation in the readme file and online information related to the software and firmware to ensure thatyou have the latest information and the latest supported levels of firmware.

If this is the first time you have read the planning overview and you understand the overall intent of theplanning tasks, go back to the beginning and start accessing the links and cross-references to get moredetails.

Required level of support, firmware, and devicesUse this information to find the minimum requirements necessary to support InfiniBand networkclustering.

The following tables list the minimum hardware, software, and firmware requirements that are necessaryto support InfiniBand network clustering.

Note: For the most recent updates to this information, see the Facts and features report Web site(http://www.ibm.com/servers/eserver/clusters/hardware/factsfeatures.html).

Table 23 on page 31 lists the model or feature that is needed to support the given device.

30 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 41: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 23. Verified and approved hardware associated with a POWER6 processor-based IBM System p or IBM PowerSystems server cluster with an InfiniBand network

Device Model or feature

Servers POWER6

v IBM Power 520 Express (8203-E4A) (four unit rack-mounted server)

v IBM Power 550 Express (8204-E8A) (four unit rack-mounted server)

v IBM Power 575 (9125-F2A)

Switches QLogic

v QLogic 9024CU Managed 24-port DDR InfiniBand Switch

v QLogic 9040 48-port DDR InfiniBand Switch

v QLogic 9080 96-port DDR InfiniBand Switch

v QLogic 9120 144-port DDR InfiniBand Switch

v QLogic 9240 288-port DDR InfiniBand Switch

Host channel adapters(HCAs)

The feature code is dependent on the server you have. Order one or more InfiniBandGX, dual-port HCAs for each server that requires connectivity to InfiniBand networks.The maximum number of HCAs allowed depends on the server model.

Fabric managementserver

IBM System x 3550 or 3650

SLES 10 Linux

QLogic HCAs

Notes:

v High-performance computing (HPC) must be proven and validated to work in an IBM HPC cluster environment.

v For approved IBM Power Systems and System p InfiniBand configurations, see the Facts and features report Website (http://www.ibm.com/servers/eserver/clusters/hardware/factsfeatures.html).

Table 24 lists the minimum levels of software and firmware that are associated with an InfiniBand cluster.

Table 24. Minimum levels of software and firmware associated with an InfiniBand cluster

Software Minimum level

AIX AIX 5L Version 5.3 with the 5300-08 Technology Level with Service Pack 2

SUSE Linux EnterpriseServer 10

SUSE Linux Enterprise Server 10 with SP2

IBM InfiniBand GX+ HCA driver and OpenIB/Gen2 stack available in SLES10SP2-AS

Hardware ManagementConsole

Version 7 Release 3.3

System firmware level forIBM Power Systemsserver or System p server

InfiniBand switchfirmware

QLogic 4.2.1.1

Fabric Manager QLogic 4.2.1.1

Fast Fabric Toolset

QLogic host stack forfabric management server

Infinserv 4.2.0.0.39 for QLogic

For the most recent support information, see the IBM Clusters with the InfiniBand Switch Web site.

Clustering with high-performance computing by using InfiniBand hardware 31

Page 42: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Server planningThis information provides server planning requirements that are relative to the fabric.

When you plan for servers in relation to the fabric, consider the following information:v The number of each type of server you require.v The type of operating systems running on each server.v The number and type of host channel adapters (HCAs) that are required in each server.v Which types of HCAs are required in each server.v The IP addresses that are needed for the InfiniBand network. For details, see “IP subnet addressing

restriction” on page 37.v The IP addresses that are needed for the service virtual local area network (VLAN) for service

processor access from the Cluster Systems Management (CSM) and the Hardware ManagementConsole (HMC).

v The IP addresses for the cluster VLAN to allow operating system access from CSM.

Note: You cannot create logical partitions in high-performance computing (HPC) clusters.

Along with server planning documentation, you can use the “Server planning work sheet” on page 64 asa planning aid. You can also review server installation documentation to help plan for the installation.When you have identified the frames in which you plan to install your servers, record the information onthe “Frame and rack planning work sheet” on page 63.

Planning InfiniBand network cabling and configurationBefore you plan your InfiniBand network cabling, review the hardware installation and cablinginformation for your vendor switch.

While planning for cabling, evaluate the IBM server and frame physical characteristics that affect cableplanning. Consider the following characteristics.v The server height and placement in the frame to plan for cable routing within the frame. The server

placement affects the distance of the host channel adapter (HCA) connectors from the top of the raisedfloor.

v The routing to the cable entrance of a frame.v The cable routing within a frame, especially with respect to bend radius and cable management.v The floor depthv The plan for connections from the fabric management servers. For more information, see “Planning for

the fabric management server” on page 47.

If you are using 12X HCAs (for example, in a 9119-590 server), review “Planning for 12X host channeladapter connections” on page 58 to understand the unique cabling and configuration requirements whenusing these adapters with the available 4X switches.

You can record the cable connection information in the “QLogic switch planning work sheets” on page 66for the switch port connections, and in a “Server planning work sheet” on page 64 for the HCA portconnections.

Planning for QLogic InfiniBand switch configurationsYou can plan for QLogic switch configurations by using QLogic planning resources, including generalplanning guides and planning guides specific to the model being installed.

QLogic switches require custom configuration to work correctly in a high performance computing (HPC)clusters by using IBM Power Systems. You must plan and record the following configuration settings.v A static IP address for the cluster virtual local area network (VLAN)

32 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 43: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

v Chassis maximum transfer units (MTU) valuev Switch namev 12X cabling considerationsv SSH access to the switches (disable Telnet)v Remote logging destination (use Cluster Systems Management/Management Server (CSM/MS) if

possible)v New chassis passwords

The IP addressing that a QLogic switch has on the management Ethernet network is configured for staticaddressing. These addresses are associated with the switch management function. The following items areimportant concepts for QLogic management functions.v The 9024 switches have a single address that is associated with its management Ethernet connection.v All other QLogic switches have one or more managed spine cards per chassis. If you want backup

capability for the management subsystem, you must have more than one managed spine in a chassis.v Each managed spine has its own address so that it can be addressed directly.v Each switch chassis also gets a management Ethernet address that is assumed by the master

management spine. This allows you to use a single address to query the chassis regardless of whichspine is the master spine. To set up management parameters (such as which spine is the master spine)each managed spine must have a separate address.

v The QLogic 9240 switch chassis is divided into two managed hemispheres. Therefore, a master andbackup managed spine within each hemisphere is required, creating a total of four managed spines.– Each managed spine has its own management Ethernet address.– The chassis has two management Ethernet addresses. One for each hemisphere.– Review the 9240 Users Guide to ensure that you understand which spine slots are used for managed

spines.v The total number of management Ethernet addresses is determined by the switch model.

– The 9024 has one address– The 9240 has 4 (no redundancy) - 6 (full redundancy) addresses– All other models have from 2 (no redundancy) - 3 addresses.

v For topology and cabling, see “Planning InfiniBand network cabling and configuration” on page 32.

The chassis maximum transmission unit (MTU) must be set to an appropriate value for each switch in acluster. For more information, see “Planning for maximum transfer units (MTUs)” on page 34.

For each subnet, you need to plan a different GID-prefix. For more information, see “Planning for globalidentifier prefixes” on page 35.

You can assign a name to each switch. The name can be one that indicates the physical location of theswitch in the data center. You might want to include the frame and slot in which the switch is installed.The key is a consistent naming convention that is meaningful to you and your service provider.

If you have a 4X switch connecting to a 12X host channel adapter (HCA), a 12X-to-4X width exchangercable is required. For more details, see “Planning for 12X host channel adapter connections” on page 58.

While passwordless ssh is recommended from CSM/MS and the fabric management server to the switchchassis, you can also change the switch chassis default password early in the installation process. For FastFabric Toolset functions, all the switch chassis passwords can be the same.

You can also consolidate switch chassis logs and embedded Subnet Manager logs into a central location.If possible, use CSM as the Systems Management application. Then, the CSM/MS can be the recipient of

Clustering with high-performance computing by using InfiniBand hardware 33

Page 44: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

the remote logs from the switch. You can only direct logs from a switch to a single remote host(CSM/MS). For details on setting up remote logging in a cluster, see “Setting up remote logging” on page95.

The information gathered here can be recorded in the “QLogic switch planning work sheets” on page 66.

Planning for maximum transfer units (MTUs):

Use this information to plan for maximum transfer units (MTU).

Based on your configuration, different MTUs can be used.

Table 25 lists the MTU values that the message-passing-interface (MPI) and Internet Protocol (IP) requirefor maximum performance.

The cluster type indicates the type of cluster based on the generation and type of host channel adapters(HCAs) that are used. You either have a homogeneous cluster where all the HCAs are of the samegeneration and type, or a heterogeneous cluster where the HCAs are a mix of generations and types.Cluster composition by HCA indicates the actual generation and type of HCAs being used in the cluster.

Switch and subnet manager settings indicate the settings for the switch chassis and subnet manager. Thechassis MTU is used by the switch chassis and applies to the entire chassis. The chassis MTU can be setthe same for all chassis in the cluster. Furthermore, the chassis MTU affects the message-passing-interface.The broadcast MTU is set by the subnet manager and affects the IP address. It is part of the broadcastgroup settings. It can be the same for all broadcast groups.

The message-passing-interface MTU indicates the setting that the message-passing-interface requires forthe configuration. The IP address MTU indicates the setting that the IP address requires. Themessage-passing interface and IP address MTU are included in Table 25 to illustrate the settings indicatedin the switch and subnet manager settings column. The BC rate is the broadcast MTU rate setting, whichcan either be 10 GB (3) or 20 GB (6). The SDR switches run at 10 GB and DDR switches run at 20 GB.

The number in parentheses in Table 25 indicates the parameter setting in the firmware and subnetmanager which represents that setting.

Table 25. MTU settings

Cluster typeCluster compositionby HCA

Switch and subnetmanager settings

Message-passing-interface MTU

Internet ProtocolMTU

Homogeneous HCAs System p5® GX HCAonly

Chassis MTU = 2 KB(4)

Broadcast MTU = 2KB (5)

BC rate = 10 GB (3)

2 KB 2 KB

Homogeneous HCAs System p (POWER6)GX HCA only in9125-F2A

Chassis MTU = 4 KB(5)

Broadcast MTU = 4KB (5)

BC rate = 10 GB (3)for SDR switches, or20 GB (6) for DDRswitches

4 KB 4 KB

34 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 45: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 25. MTU settings (continued)

Cluster typeCluster compositionby HCA

Switch and subnetmanager settings

Message-passing-interface MTU

Internet ProtocolMTU

Homogeneous HCAs POWER6 GX HCA in8203-E4A or8204-E8A

Chassis MTU = 2 KB(4)

Broadcast MTU = 2KB (4)

BC rate = 10 GB (3)for SDR switches, or20 GB (6) for DDRswitches

2 KB 2 KB

Homogeneous HCAs ConnectX HCA only Chassis MTU = 2 KB(4)

Broadcast MTU = 2KB (4)

BC rate = 10 GB (3)for SDR switches, or20 GB (6) for DDRswitches

2 KB 2 KB

Heterogeneous HCAs GX HCA in 9125-F2A(compute servers1)and GX HCA in8204-E8A or8203-E4A (GeneralParallel File System(GPFS) servers)

Chassis MTU = 4 KB(5)

Broadcast MTU = 2KB (4)

BC rate = 10 GB (3)

Between computeserver only = 4 KB

2 KB

Heterogeneous HCAs POWER6 GX HCA(compute server) andp5 HCA (GPFSservers)

Chassis MTU = 4 KB(5)

Broadcast MTU = 2KB (4)

BC rate = 10 GB (3)

Between POWER6only = 4 KB

2 KB

Heterogeneous HCAs ConnectX HCA(compute server) andp5 HCA (GPFSservers)

Chassis MTU = 2 KB(4)

Broadcast MTU = 2KB (4)

BC rate = 10 GB (3)

2 KB 2 KB

1 Compute servers primarily perform computation and the main work of applications.

Record the configuration settings for the Fabric Managers in the “QLogic Fabric Management worksheets” on page 72.

Record the configuration settings for the switches in the “QLogic switch planning work sheets” on page66.

Planning for global identifier prefixes:

This information describes why and how to plan for fabric global identifier (GID) prefixes in an IBMSystem p high-performance computing (HPC) cluster.

Clustering with high-performance computing by using InfiniBand hardware 35

Page 46: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Each subnet in the InfiniBand network must be assigned a GID prefix, which is used to identify thesubnet for addressing purposes. The GID prefix is an arbitrary assignment with a format ofxx:xx:xx:xx:xx:xx:xx:xx (for example, FE:80:00:00:00:00:00:01). The default GID prefix isFE:80:00:00:00:00:00:00.

The GID prefix is set by the subnet manager. Therefore, each instance of the subnet manager must beconfigured with the appropriate GID prefix. On any given subnet, all instances of the subnet manager(master and backups) must be configured with the same GID prefix.

Planning GID Prefixes ends here.

Configuring an IBM GX host channel adapterAn IBM GX host channel adapter (HCA) must have certain configuration settings to work in an IBMPOWER® InfiniBand cluster.

The following configuration settings are required to work with an IBM POWER InfiniBand cluster.v Globally-unique identifier (GUID) indexv Capabilityv Global identifier (GID) prefix for each port of a host channel adapter (HCA)

InfiniBand subnet IP addressing is based on subnet restrictions. For more information, see “IP subnetaddressing restriction” on page 37. Each physical InfiniBand HCA contains a set of 16 GUIDs that can beassigned to logical partition profiles. These are used to address logical HCA (LHCA) resources on anHCA. You can assign multiple GUIDs to each profile, but you can assign only one GUID from each HCAto each partition profile. Each GUID can be used by only one logical partition at a time. You can createmultiple logical partition profiles with the same GUID, but only one of those logical partition profiles canbe activated at a time.

The GUID index is used to choose one of the 16 GUIDs available for an HCA. It can be any number 1-16.Often, you can assign a GUID index based on which logical partition and profile you are configuring. Forexample, on each server you might have four logical partitions. The first logical partition on each servermight use a GUID index of 1, the second would use a GUID index of 2, and so on.

The Capability setting is used to indicate the level of sharing that can be done. The levels of sharing areas follows.1. Low2. Medium3. High4. Dedicated

Although the GID prefix for a port is not explicitly set, it is important to understand the subnet to whicha port is attached. This is determined by the switch to which the HCA port is connected. The GID prefixis configured for the switch.

Record the configuration settings in the “Server planning work sheet” on page 64, which is used torecord HCA configuration information.

Note: The 9125-F2A servers with an InfiniBand interface on the I/O planar might have an extraInfiniBand device defined. The defined device is always iba3. Delete iba3 from the configuration.

36 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 47: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Related concepts

“Planning for global identifier prefixes” on page 35This information describes why and how to plan for fabric global identifier (GID) prefixes in an IBMSystem p high-performance computing (HPC) cluster.Related information

Partition profile

IP subnet addressing restrictionThere are restrictions to how you can configure Internet Protocol (IP) subnet addressing in a server that isattached to an InfiniBand network.

Both IP and InfiniBand use the term subnet. These are two distinctly different entities.

The IP addresses for the host channel adapter (HCA) network interfaces must be set up so that no two IPaddresses in a given logical partition are on the same IP subnet. When planning for the IP subnets in thecluster, as many separate IP subnets can be established as there are IP addresses on a given logicalpartition.

The subnets can be set up so that all IP addresses in a given IP subnet are connected to the sameInfiniBand subnet. If there are n network interfaces on each logical partition connected to the sameInfiniBand subnet, then n separate IP subnets can be established.

Note: This IP subnetting limitation does not prevent multiple adapters or ports from being connected tothe same InfiniBand subnet. It is only an indication of how the IP addresses must be configured.

Management subsystem planningThis information is a summary of the planning required for the components of the managementsubsystem.

The components of the management subsystem include the following tasks:v The service and cluster virtual local area network (VLAN)v Hardware Management Console (HMC)v Systems Management application and serverv Vendor fabric management applicationsv Network Installation Management (NIM) server and distribution server

Information on planning for the management subsystem is provided. Also, information to help you planfor the frames that house the management consoles.

Customer-supplied Ethernet service and cluster VLANs are required to support the InfiniBand clustercomputing environment. The number of Ethernet connections depends on the number of servers, bulkpower controllers (BPCs) in 24-inch frames, InfiniBand switches, and HMCs in the cluster. The SystemsManagement application and server, which might include Cluster-Ready Hardware Server (CRHS)software would also require a connection to the service VLAN.

Note: While you can have two service VLANs on different subnets to support redundancy in IBMservers, BPCs, and HMCs, the InfiniBand switches only support a single service VLAN, even thoughsome InfiniBand switch models have multiple Ethernet connections. The Ethernet connections connect todifferent management processors and, therefore, can connect to the same Ethernet network.

An HMC might be required to manage the logical partitions, and to configure the GX bus host channeladapters (HCAs) in the servers. The maximum number of servers that can be managed by an HMC is 32.When there are more than 32 servers, additional HMCs are required.

Clustering with high-performance computing by using InfiniBand hardware 37

Page 48: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

If you require more than one HMC to manage your cluster servers and switches, you must use CRHS inCluster Systems Management (CSM) on a CSM Management Server. See the CSM Install and PlanningGuide on the IBM Cluster information center at http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.csm.doc/csm141/am7il11018.html. You can also use CSM when youhave only one HMC.

If you have a single HMC in the cluster, it is normally configured to be the required Dynamic HostConfiguration Protocol (DHCP) server for the service VLAN, and the Cluster SystemsManagement/Management Server (CSM/MS) is the DHCP server for the cluster VLAN. If CRHS andCSM are used in the cluster, the CSM Management Server is typically set up as the DHCP server for theservice and cluster VLANs, and CRHS must be configured to recognize the servers, bulk powerassemblies (BPAs), and HMCs. See the CSM: Administration Guide located on the QLogic Web site athttp://www.qlogic.com/default.aspx.

The servers have connections to the service and cluster VLANs. See CSM documentation for moreinformation about the cluster VLAN. See the server documentation for more information aboutconnecting to the service VLAN. In particular, consider the following items.v The number of service processor connections from the server to the service VLANv If there is a BPC for the power distribution, as in a 24-inch frame, it might provide a hub for the

processors in the frame, allowing for a single connection for each frame to the service VLAN.

After you know the number of devices and cabling of your service and cluster VLANs, you need toconsider the device IP addressing. The following items are the key considerations for IP addressing:1. Determine the domain addressing and netmasks for the Ethernet networks that you implement.2. Assign static IP addresses:

a. Assign a static IP address for HMCs when you are using CSM and CRHS. This is mandatorywhen you have multiple HMCs in the cluster.

b. Assign a static IP address for switches when you are using CSM and CRHS. This is mandatorywhen you have multiple HMCs in the cluster.

3. Determine the DHCP range for each Ethernet subnet.4. If you must use CSM and CRHS, the DHCP server must be on the CSM Management Server, and all

HMCs must have their DHCP server capability disabled. Otherwise, you are in a single HMCenvironment where the HMC is the DHCP server for the service VLAN.

If there are servers in the cluster that do not have removable media (CD or DVD) capabilities, you need aNetwork Installation Management (NIM) server for stand-alone diagnostics. If you are using the AIXoperating system in your logical partitions, AIX also provides NIM service for the logical partition. TheNIM server is on the cluster VLAN.

If there are servers running the Linux operating system on your logical partitions that do not haveremovable media (CD or DVD) capabilities, a distribution server is required. The “Cluster summary worksheet” on page 61 can be used to record the information for your management subsystem planning.

You need to plan for frames or racks for the management servers. You can consolidate the managementservers into the same rack if possible. The following management servers can be considered.v HMCv CSM management serverv Fabric management serverv NIM (AIX) and distribution servers (Linux)v Network time protocol (NTP) server

Further management subsystem considerations include:

38 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 49: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

v Reviewing “Installing and configuring the management subsystem” on page 83 for the managementsubsystem installation tasks. The information helps you to assign tasks in the “Installation coordinationwork sheet” on page 57.

v “Planning CSM as your systems management application”v “Planning for QLogic fabric management applications” on page 40v “Planning for the fabric management server” on page 47v “Planning event monitoring with QLogic and CSM” on page 48v “Planning to run remote commands with QLogic from the CSM/MS” on page 49

Planning CSM as your systems management applicationWhen you use Cluster-Ready Hardware Server (CRHS) with Cluster Systems Management (CSM), theCSM Management Server (CSM/MS) is typically the Dynamic Host Configuration Protocol (DHCP)server for the service virtual local area network (VLAN). If the cluster VLAN is a public site network or alocal site network, then it is possible that another server might be set up as the DHCP server.

The configuration settings planned in this topic can be recorded in the “Cluster Systems Managementplanning work sheet” on page 77.

You must set up the CSM/MS to be a stand-alone server. If you use one of the compute servers or I/Oservers in the cluster for CSM, the CSM operation might degrade performance for user applications andcomplicate the installation process with respect to server setup and discovery on the service VLAN.

You must also set up CSM event management to be used in a cluster. To do this action, you need to planfor the following items:v The type of syslogd command that you are going to use. At the least, you need to understand the

default syslogd command that comes with the operating system on which CSM is run.v Whether you want to use Transmission Control Protocol (TCP) or User Datagram Protocol (UDP) as

the protocol for transferring syslog entries from the fabric management server to the CSM/MS. Youmust use UDP if the CSM/MS is using the syslog command. If the CSM/MS has the syslog-ng loggingapplication installed, you can use TCP for better reliability. The switches only use UDP.

v If the syslog-ng application is used on the CSM/MS, a source reference (SRC) line controls the IPaddresses and ports over which the syslog-ng application accepts logs. The default setup is address0.0.0.0, which means all addresses. For added security, you might want to plan to have an SRCdefinition for each switch IP address and each fabric management server IP address rather thanopening all IP addresses on the service VLAN. For information about the format of the SRC line, see“Setting up remote logging” on page 95.

Running the remote command from the CSM/MS to the fabric management servers is advantageouswhen you have more than one Fabric Management Server in a cluster. To start the remote command tothe fabric management servers, you need to research how to exchange Secure Shell (SSH) keys betweenthe fabric management server and the CSM/MS. This is standard, open SSH protocol setup as done ineither the AIX operating system or the Linux operating system.

If you do not require a CSM Management Server, you might need a server to act as a NetworkInstallation Manager (NIM) server for diagnostics. This is the case for servers that do not have removablemedia (CD or DVD) capabilities, such as the 9118-575.

If you have servers with no removable media capabilities that are running Linux logical partitions, youmight need to have a server act as a distribution server.

If you require both a NIM server and a distribution server, and you choose the same server for both, areboot is needed to change between the services. If the NIM server is used only for stand-alonediagnostics, it might be acceptable to use the same server. However, by using the same server for both aNIM server and distribution server might prolong a service call if use of the NIM server is required. For

Clustering with high-performance computing by using InfiniBand hardware 39

Page 50: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

example, the server that might normally act as a distribution server could have a second boot image tothe server as the NIM server. If NIM server services are required for stand-alone diagnostics during aservice call, the distribution server must be rebooted to the NIM image before diagnostics can beperformed.

Planning for QLogic fabric management applicationsUse this information to plan for the QLogic fabric management applications.

Planning the Fabric Manager and Fabric ViewerUse this information to plan for the Fabric Manager and the Fabric Viewer.

The configuration setting planned in this topic can be recorded in “QLogic Fabric Management worksheets” on page 72.

Most details are available in the Fabric Manager and Fabric Viewer Users Guide from QLogic. Thisinformation highlights information from a cluster perspective.

The Fabric Viewer is intended to be used as documented by QLogic.

The Fabric Manager has a few key parameters that can be set up in a specific manner for IBMhigh-performance computing (HPC) clusters.

The following items are the key planning points for your Fabric Manager in an IBM HPC cluster.

See Figure 8 on page 41 and Figure 9 on page 42 for illustrations of typical fabric managementconfigurations.v IBM has only qualified the use of a host-based Fabric Manager (HFM). The HFM is typically referred

to as host-based subnet manager (HSM), because the subnet manager is considered the most importantcomponent of the Fabric Manager.

v The host for HSM is the fabric management server. For more information, see “Planning for the fabricmanagement server” on page 47.– The host requires one host channel adapter (HCA) port per subnet to be managed by the subnet

manager.– If you have more than four subnets in your cluster, you must have two hosts actively servicing your

fabrics. To allow for backups, up to four hosts as fabric management servers are required. That is,two hosts act as primaries and two hosts act as backups.

– If possible, consolidate switch chassis and subnet manager logs to a central location. Since ClusterSystems Management (CSM) can be used as the Systems Management application, theCSM/Management Server (MS) can be the recipient of the remote logs from the switch. You candirect logs from a fabric management server to multiple remote hosts (that is, to Cluster SystemsManagement/Management Servers (CSM/MSs). See “Setting up remote logging” on page 95 for theprocedure that is used to set up remote logging in the cluster.

v If possible, use backup fabric management servers for HSM.v At least one unique instance of Fabric Manager is required to manage each subnet.

– A host-based Fabric Manager instance is associated with a specific HCA and port over which itcommunicates with the subnet that it manages. For example, if you have four subnets and onefabric management server, it has four instances of subnet manager running on it, one for eachsubnet. Also, the server must be attached to all four subnets.

– The Fabric Manager consists of four processes: the subnet manager (SM), the performance manager(PM), baseboard manager (BM) and fabric executive (FE). For more details, see “Fabric manager” onpage 20 and the QLogic Fabric Manager Users Guide.

– It is common practice in the industry for the terms subnet manager and Fabric Manager to be usedinterchangeably, because the subnet manager performs the most vital role in managing the fabric.

40 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 51: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

v The HSM license fee is based on the size of the cluster that it covers. See the vendor documentationand Web site for more details.

v The following items are embedded subnet manager (ESM) considerations.– IBM is not qualifying the ESM.– If you use an ESM, you might experience performance problems and outages if the subnet has more

than 64 IBM GX or GX+ HCA ports attached to it. This is because of the limited compute power andmemory available to run the ESM in the switch. Also, performance problems can be seen becausethe IBM GX or GX+ HCAs can present themselves as multiple logical devices because they can bevirtualized. For more information, see “IBM GX or GX+ host channel adapters” on page 9.Considering these restrictions, you might want to restrict ESM use to subnets with only one model9024 switch in them.

– If you plan to use the ESM, you will need the fabric management server for the Fast Fabric Toolset.For more information, see “Planning the Fast Fabric Toolset” on page 45. If you use ESM, it does noteliminate the need for a fabric management server. The need for a backup fabric management serveris not as great, but it is still recommended.

– You might find it simpler to maintain host-based subnet manager code than ESM code.– You must obtain a license for the ESM because it is keyed to the switch chassis serial number.

Figure 8. Typical Fabric Manager configuration on a single fabric management server

Clustering with high-performance computing by using InfiniBand hardware 41

Page 52: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

The key parameters for which to plan for the Fabric Manager are:

Figure 9. Typical fabric management server configuration with eight subnets

42 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 53: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Note: A parameter might apply to only a certain component of the Fabric Manager. Otherwise, youmust specify that parameter for each component of each instance of the Fabric Manager on the fabricmanagement server. Components of the Fabric Manager are the subnet manager, the performancemanager, the baseboard manager, and the fabric executive.v Plan a global identifier (GID) prefix for each subnet. Each subnet requires a different GID prefix, which

is set by the subnet manager. The default is 0xfe80000000000000. This is for the subnet manager only.v LMC = 2 to allow for four local identifiers (LIDs). This is important for IBM message-passing-interface

(MPI) performance. This is for the subnet manager only.v For each fabric management server, plan which instance of the Fabric Manager is used to manage each

subnet. Instances are numbered from 0 to 3 on a single fabric management server. For example, if asingle fabric management server is managing four subnets, you typically have instance 0 manage thefirst subnet, instance 1 manage the second subnet, and so on. All components under a particular FabricManager instance are referenced by using the same instance. For example, Fabric Manager instance 0has SM_0, PM_0, BM_0, and FE_0.

v For each fabric management server, plan which HCA and HCA port on each connects with whichsubnet. You need this to point each fabric management instance to the correct HCA and HCA port sothat it manages the correct subnet. This is specified individually for each component. However, it canbe the same for each component in each instance of Fabric Manager. Otherwise, you could have the SMcomponent of the Fabric Manager 0 manage one subnet and the PM component of the Fabric Manager0 managing another subnet. This makes it confusing to understand how things are set up. Typically,instance 0 manages the first subnet, which typically is the first port of the first HCA, and instance 1manages the second subnet, which typically is on the second port of the first HCA, and instance 2manages the third subnet, which typically is on the first port of the second HCA, and instance 3manages the fourth subnet, which typically is on the second port of the second HCA.

v Plan for a backup Fabric Manager for each subnet:– Assign priorities for each subnet manager instance so that you have a master-and-backup takeover

scheme. The master has the highest priority number. The lowest priority is 0 and the highest priorityis 15. Typically, you only have a single backup for each subnet, so you use priority number 1 for themaster, and priority number 0 for the backup.

– This is specified individually for each component, but it can be the same for each instance. Forexample, if SM_0_priority=1, PM_0_priority=1, BM_0_priority=1, FE_0_priority=1. While it ispossible to assign different priorities to different components within the same instance of a FabricManager, it would be difficult to track where the master instances were for each component on eachsubnet.

v Plan for the maximum transfer unit (MTU) by using the rules found in “Planning for maximumtransfer units (MTUs)” on page 34. This is for the subnet manager only.

v There are other parameters that can be configured for the subnet manager. However, the defaults aretypically chosen for subnet manager. Further details can be found in the QLogic Fabric Manager UsersGuide.

Example: Setting up of host-based fabric manager

The following examples show entries from an iview_fm.config file on a fabric management server thatmanages two subnets, where the Fabric Manager is the primary one, as indicated by a priority=1. Theseentries are found throughout the file in the startup section, and each of the manager sections in each ofthe instance sections. In this case, instance 0 manages subnet 1, instance 1 manages subnet 2, instance 2manages subnet 3, and instance 3 manages subnet 4.

Note: Comments in this example are not found in the example file. They are shown here to help clarifywhere in the file you would find these entries.# Start up configurationBM_0_start=yesFE_0_start=yesPM_0_start=yes

Clustering with high-performance computing by using InfiniBand hardware 43

Page 54: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

SM_0_start=yesBM_1_start=yesFE_1_start=yesPM_1_start=yesSM_1_start=yesBM_2_start=yesFE_2_start=yesPM_2_start=yesSM_2_start=yesBM_3_start=yesFE_3_start=yesPM_3_start=yesSM_3_start=yes#Instance 0SM_0_device=0SM_0_port=1SM_0_priority=1SM_X_lmc=2# MTU = 4KBSM_X_def_mc_mtu=0x5# rate is DDRSM_0_def_mc_rate=0x6SM_0_gidprefix=0xfe80000000000001SM_0_node_appearance_msg_thresh=10

PM_0_device=0PM_0_port=1PM_0_priority=1BM_0_device=0BM_0_port=1BM_0_priority=1FE_0_device=0FE_0_port=1FE_0_priority=1

#Instance 1SM_1_device=0SM_1_port=2SM_1_priority=1SM_1_lmc=2# MTU = 4KBSM_1_def_mc_mtu=0x5# rate is DDRSM_1_def_mc_rate=0x6SM_1_gidprefix=0xfe80000000000002SM_1_node_appearance_msg_thresh=10

PM_1_device=0PM_1_port=2PM_1_priority=1BM_1_device=0BM_1_port=2BM_1_priority=1FE_1_device=0FE_1_port=2FE_1_priority=1

#Instance 2SM_2_device=1SM_2_port=1SM_2_priority=1SM_X_lmc=2# MTU = 4KBSM_X_def_mc_mtu=0x5# rate is DDRSM_2_def_mc_rate=0x6SM_2_gidprefix=0xfe80000000000003SM_2_node_appearance_msg_thresh=10

PM_2_device=1

44 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 55: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

PM_2_port=1PM_2_priority=1BM_2_device=1BM_2_port=1BM_2_priority=1FE_2_device=1FE_2_port=1FE_2_priority=1

#Instance 3SM_3_device=1SM_3_port=2SM_3_priority=1SM_X_lmc=2# MTU = 4KBSM_X_def_mc_mtu=0x5# rate is DDRSM_3_def_mc_rate=0x6SM_3_gidprefix=0xfe80000000000004SM_3_node_appearance_msg_thresh=10PM_3_device=1PM_3_port=2PM_3_priority=1BM_3_device=1BM_3_port=2BM_3_priority=1FE_3_device=1FE_3_port=2FE_3_priority=1

Plan for remote logging of Fabric Manager events:v Plan to update the /etc/syslog.conf file (or the equivalent syslogd configuration file on your fabric

management server) to point syslog entries to the Systems Management Server. This file requiresknowledge of the IP address on the Systems Management Server. Limit the entries in the syslog file tothose entries that are created by the subnet manager. However, some syslogd applications generally donot allow such finely tuned forwarding.For the ESM, the forwarding of log entries is achieved through a command on the switchcommand-line interface (CLI), or through the Chassis Viewer.

v You are required to set a notice message threshold for each subnet manager instance. This message isused to limit the number of notice or higher messages logged by the subnet manager on scans of thenetwork. The suggested limit is 10. Generally, if the number of notice messages is greater than 10, theuser is probably rebooting nodes or powering on switches again and causing links to go down. Forupdates on message thresholds, see the IBM Clusters with the InfiniBand Switch Web site.

Planning the Fast Fabric ToolsetThe Fast Fabric Toolset provides reporting and health check tools that are important for managing andmonitoring the fabric.

The configuration setting planned in this topic can be recorded in “QLogic Fabric Management worksheets” on page 72.

In-depth information about the Fast Fabric Toolset can be found in the Fast Fabric Toolset Users Guideavailable from QLogic.

The following items are the key things to remember when setting up the Fast Fabric Toolset in an IBMhigh-performance computing (HPC) cluster.

There are several installation requirements, including the following list:v The Fast Fabric Toolset requires you to install the QLogic InfiniServ host stack, which is part of the

Fast Fabric Toolset bundle.

Clustering with high-performance computing by using InfiniBand hardware 45

Page 56: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

v The Fast Fabric Toolset must be installed on each fabric management server, including backups. See“Planning for the fabric management server” on page 47.

The Fast Fabric Toolset configuration must be set up in its configuration files. The default configurationfiles are documented in the Fast Fabric Toolset. The following list indicates key parameters to beconfigured in Fast Fabric Toolset configuration files.v Switch addresses go into chassis files.v Fabric management server addresses go into host files.v The IBM system addresses do not go into host files.v Create groups of switches by creating a different chassis file for each group. Some suggestions are:

– A group of all switches, because they are all accessible on the service virtual local area network(VLAN)

– Groups that contain switches for each subnet– A group that contains all switches with embedded subnet manager (ESM) (if applicable)– A group that contains all switches running primary ESM (if applicable)– Groups for each subnet that contain the switches running ESM in the subnet (if applicable) - include

primary and backup fabric management serversv Create groups of fabric management servers by creating a different host file for each group. Some

suggestions are:– A group of all fabric management servers, because they are all accessible on the service VLAN.– A group of all primary fabric management servers.– A group of all backup fabric management servers.

There are some tools that are not applicable to an IBM clustering environment:v You cannot use the message-passing-interface (MPI) performance tests because they are not compiled

for the host stack on the IBM HPC clusters.v High-performance Linpack (HPL) is not applicable.

Other items to remember include the following list:v The Fast Fabric tools that rely on the InfiniBand interfaces to collect and report data can only work

with subnets to which their server is attached. Therefore, if you require more than one primary fabricmanagement server, because you have more than four subnets, you need to run two different instancesof the Fast Fabric Toolset on two different servers to query the state of all the subnets.

v The Fast Fabric Toolset is used to interface with the following hardware:– Switches– Fabric management server hosts– Vendor systems

v To use Cluster Systems Management (CSM) for remote command access to the Fast Fabric Toolset, youmust set up the host that is running Fast Fabric as a device. You can exchange Secure Shell (SSH) keyswith it for access without the need for a password.

v The master node referred to in the Fast Fabric Toolset Users Guide is considered to be the host that isrunning the Fast Fabric Toolset. In IBM System p or IBM Power Systems HPC clusters, this host is nota compute server or I/O node, but is generally the fabric management server.

v Plan an interval at which to run Fast Fabric Toolset health checks. Since health checks use fabricresources, you must not run them frequently enough to cause performance problems. Use therecommendation given in the Fast Fabric Toolset Users Guide.

v You need to configure the Fast Fabric Toolset health checks to use either the hostsm_analysis tools forhost-based fabric management or esm_analysis tools for embedded fabric management.

v If you are using host-based fabric management, you must configure the Fast Fabric Toolset to access allthe fabric management servers running Fast Fabric.

46 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 57: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

v If you set up an environment that requires passwords for SSH between the fabric management serverand the switches, you must set up the fastfabric.conf file with the switch chassis passwords.

Planning for the fabric management serverWith QLogic switches, the fabric management server is required to run the Fast Fabric Toolset, which isused for managing and monitoring the InfiniBand network.

The configuration settings planned in this topic can be recorded in the “QLogic Fabric Management worksheets” on page 72.

Furthermore, with QLogic switches, unless you have a small cluster, you should use the host-basedFabric Manager, which runs on the fabric management server. The fabric management server has thefollowing requirements:

Hardware requirements:v The System x 3550 is one unit high and supports two PCI Express (PCIe) slots. It can support a total of

four subnets.v The System x 3650 is two units high and supports four PCIe slots. At this time, QLogic subnet

management only supports four subnets on a server. Therefore, the advantage of the 3650 over the3550 is the processing power for scaling to large clusters, or for sharing the fabric management serverwith other functions.

v Plan sufficient rack space for the fabric management server. If space is available, the fabricmanagement server should be placed in the same rack with other management consoles such as theHardware Management Console (HMC) or the Cluster Systems Management/Management Server(CSM/MS).

Operating system and software requirements:v The Linux SLES 10 operating system is required.v You need the QLogic Fast Fabric Toolset bundle, which includes the QLogic host stack. For more

information, see “Planning the Fast Fabric Toolset” on page 45.

The number of fabric management servers is determined by the following parameters:v Up to four subnets can be managed from each fabric management server.v One backup fabric management server must be available for each primary fabric management server.v For up to four subnets, a total of two fabric management servers must be available, one primary and

one backup.v For up to eight subnets, a total of four fabric management servers must be available, two primaries

and two backups.

Cluster Systems Management (CSM) event management must be used in a cluster. To use CSM eventmanagement, plan for the following requirements:v The type of syslogd command that you use. At a minimum, you need to understand the default

syslogd command that comes with the operating system on which CSM is run.v Whether or not you want to use Transmission Control Protocol (TCP) or User Datagram Protocol

(UDP) as the protocol for transferring the syslog entries from the fabric management server to theCSM/MS. Use TCP for better reliability.

Other requirements:v You need one QLogic host channel adapter (HCA) for every two subnets to be managed by the server,

with a maximum of four subnets.v You need the QLogic host-based Fabric Manager. For more information, see “Planning the Fabric

Manager and Fabric Viewer” on page 40.

Clustering with high-performance computing by using InfiniBand hardware 47

Page 58: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

v There needs to be a backup fabric management server that has a symmetrical configuration to that ofthe primary fabric management server, for any given group of subnets. This means that an HCA devicenumber and port on the backup must be attached to the same subnet as they are to the correspondingHCA device number and port on the primary fabric management server.

v Designate a single fabric management server to be the primary data collection point for fabricdiagnosis data.

v To start remote commands to the fabric management server, you must know how to exchange SecureShell (SSH) keys between the fabric management server and the CSM/MS. This is standard OpenSSHProtocol setup as done in either the AIX or Linux operating system.

In addition to planning for requirements, see “Planning the Fast Fabric Toolset” on page 45 forinformation about creating hosts groups for fabric management servers. These are used to set upconfiguration files for hosts for Fast Fabric tools.

Planning event monitoring with QLogic and CSMEvent monitoring for fabrics by using QLogic switches can be done with a combination of using remotesyslog commands and Cluster Systems Management (CSM) event management. Use this information toplan event monitoring of fabrics by using QLogic switches.

The configuration settings planned in this topic can be recorded in the “Cluster Systems Managementplanning work sheet” on page 77.

The result of event management is the ability to forward switch and fabric management logs in a singlelog file on the CSM/MS in the typical event management log directory (var/log/csm/errorlog) withmessages in the audit log. You can also use the included response script to “wall” log entries to theCluster Systems Management/Management Server (CSM/MS) console. Finally, you can use the ReliableScalable Cluster Technology (RSCT) event sensor and condition-response infrastructure to write your ownresponse scripts to react to fabric log entries in the form that you want. For example, you could e-mailthe log entries to an account.

For event monitoring to work between the QLogic switches and fabric manager and CSM eventmonitoring, the switches, CSM/MS, and fabric management server that are running the host-based FabricManager must all be on the same virtual local area network (VLAN). The cluster VLAN can be used.

To plan for event monitoring, complete the following items:v For more information about the event monitoring infrastructure, review the CSM Administration Guide,

CSM Planning and Installation Guide, as well as the RSCT Administration Guide.v Plan for the CSM/MS IP address, so that you can point the switches and Fabric Manager to log there

remotely.v Plan for the CSM/MS operating system, so that you know which syslog sensor and condition to use.

One of the following sensors and conditions can be used:– For CSM running on the AIX operating system, the sensor is AIXSyslogSensor and the condition is

the LocalAIXNodeSyslog command (generated from the AIXNodeSyslog command, but with localscope). You can use AIXNodeSyslog for the condition if the CSM/MS is configured as a managednode, but you might find it easier to manage the cluster if the CSM/MS is not a managed node.

– For CSM running on the Linux operating system, the sensor is ErrorLogSensor and the condition isthe LocalNodeAnyLoggedError command (generated from the AnyNodeAnyLoggedError command). Youcan use LocalNodeAnyLoggedError for the condition if the CSM/MS is configured as a managednode, but you might find it easier to manage the cluster if the CSM/MS is not a managed node.

v To determine which response scripts to use, evaluate the following options:– Use of the LogNodeErrorLogEntry command is the minimum requirement to combine log entries to a

/var/log/csm/errorlog file.

48 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 59: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

– Use the BroadcastEventsAnyTime command to broadcast events to the CSM/MS console. This resultsin many broadcasts during reboot scenarios. Therefore, if you use this log, you want to establish aprocedure to disable it when you know that you are doing operations such as shutting down all theservers in a cluster.

– Consider creating response scripts that are specialized to your environment. For example, you mightwant to send log entries to an e-mail account. See RSCT and CSM documentation for how to createsuch scripts and where to find the response scripts associated with the LogNodeErrorLogEntrycommand and the BroadcastEventsAnyTime command, which can be used as examples.

v Plan regular monitoring of the file system that contains /var on the CSM/MS to ensure that it does notget overrun.

Planning to run remote commands with QLogic from the CSM/MSRemote commands can be started from the Cluster Systems Management/Management Server (CSM/MS)by using the dsh command to the fabric management server and the switches.

The configuration settings planned in this topic can be recorded in the “Cluster Systems Managementplanning work sheet” on page 77.

Running remote commands is an important addition to the management infrastructure because iteffectively integrates the QLogic management environment with the IBM management environment.

The following are some of the benefits for running remote commands:v You can do manual queries from the CSM/MS console without logging in to the fabric management

server or switch.v You can write management and monitoring scripts that run from the CSM/MS, which can improve

productivity for administration of the cluster fabric. For example, you can write scripts to act on nodesbased on fabric activity, or to act on the fabric based on node activity.

v You can capture data across multiple fabric management servers or switches simultaneously.

Consider the following items when you plan for remote command processing.v CSM must be installed.v The fabric management server and switch addresses are used.v The fabric management server and switches are created as devices. However, it is also possible to

create the fabric management server as a node.v Device attributes for the fabric management server are:

– DeviceType equals FabricMS– RemoteShellUser equals [USERID] root is suggested– RemoteShell equals /usr/bin/ssh– RemoteCopyCmd equals /usr/bin/scp

v Device attributes for the switch are:– DeviceType equals IBSwitch::Qlogic– RemoteShellUser equals admin– RemoteShell equals /usr/bin/ssh

v Device groups might be considered for:– All the fabric management servers– All primary fabric management servers– All the switches– A separate subnet group for all the switches on a subnet

v You can exchange Secure Shell (SSH) keys between the CSM/MS and the switches and fabricmanagement server

Clustering with high-performance computing by using InfiniBand hardware 49

Page 60: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

v For more secure installations, you might plan to disable Telnet on the switches and on the fabricmanagement server

Frame planningAfter reviewing the server, fabric device, and the management subsystem information, you can plan forthe frames in which to place all the devices.

You can record the frame information in the “Frame and rack planning work sheet” on page 63.

Planning installation flowUse this information to learn about the key installation points, the organizations responsible forinstallation, the installation responsibilities for units and devices, and the order that components areinstalled. Use the installation coordination work sheets to record the information you planned in thesetopics.

Key installation pointsWhen you are coordinating the installation of the many systems, networks, and devices in a cluster,several factors help ensure a successful installation.

Key factors for a successful installation include the following items:v The order of the installation of physical units is important. While units might be placed physically on

the data center floor in any order after the site is ready, there is a specific order for how they arecabled, powered on, and recognized on the service subsystem.

v The types of units and contractual agreements affect the composition of the installation team. The teamcan be composed of customers, IBM personnel, or third-party vendor personnel. For more guidance oninstallation responsibilities, see “Installation responsibilities of units and devices” on page 51.

v If you have 12X host channel adapters (HCAs) and 4X switches, the switches must be powered on andconfigured with proper 12X groupings before servers are powered on. The order of port configurationon 4X switches that are configured with groups of three ports acting as a 12X link is important;therefore, specific steps must be followed to ensure that the 12X HCA is connected as a 12X link andnot as a 4X link.

v All switches must be connected to the same service virtual local area network (VLAN). If redundantconnections are available on a switch, they must also be connected to the same service VLAN. This isrequired because of the IP addressing methods used in the switches.

Installation responsibilities by organizationUse this information to find who is responsible for various aspects of the installation.

Within a cluster that has an InfiniBand network, different organizations are responsible for installationactivities. The following lists provide information about responsibilities for a typical installation.However, it is possible for the specific responsibilities to change because of agreements between thecustomer and the supporting hardware teams.

Note: Given the complexity of typical cluster installations, trained and authorized installers must beused.

Customer responsibilities follow:v Set up of the management consoles (Hardware Management Console (HMC) and Cluster Systems

Management/Management Servers (CSM/MS))v Install customer setup units (according to server model)v Update system firmwarev Update InfiniBand switch software including Fabric Management softwarev If applicable, install and customize the fabric management server including:

50 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 61: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

– The connection to the service virtual local area network (VLAN)– Required vendor host stack– If applicable, the QLogic Fast Fabric Toolset

v Customize InfiniBand network configurationv Customize host channel adapter (HCA) partitioning and configurationv Verify the InfiniBand network topology and operation

IBM responsibilities follow:v Install and service the IBM-installable servers, adapters, and HCAs, including the 9125-F2A systemsv Verify server operation for IBM installable servers

Third-party vendor responsibilities follow:

Note: This information does not detail the contractual possibilities for third-party responsibilities. Bycontract, the customer might be responsible for some of these activities. Note the customer name orcontracted vendor when planning these activities so that you can better coordinate all the activities of theinstallers.v Install switchesv Set up the service VLAN IP and attach switches to the service VLANv Cable the InfiniBand networkv Verify switch operation through status and LED queries

Installation responsibilities of units and devicesUse this information to determine who is responsible for the installation of units and devices.

Note: It is possible that a contractual agreement might alter the basic installation responsibilities forparticular devices.

Table 26. Hardware to install and who is responsible for the installation

Hardware to install Who is responsible for the installation

Servers Unless otherwise contracted, the use of a server in a cluster with anInfiniBand network does not change the normal installation and serviceresponsibilities for it. There are some servers that are installed by IBM andothers that are installed by the customer. See the specific serverinformation to determine who is responsible for the installation.

Installation responsibilities forHardware Management Console(HMC)

The type of servers attached to the HMCs determine who installs them.See the HMC documentation to determine who is responsible for theinstallation. This task is typically the customer or IBM service.

Cluster Systems Management (CSM) CSM is the set of systems management tools. It can also be used as acentralized source for device discovery in the cluster. The customer isresponsible for CSM installation and customization.

InfiniBand switches The switch manufacturer or its designee (IBM business partner) or anothercontracted organization is responsible for installing the switches.

Switch network cabling The customer must work with the switch manufacturer, designee, oranother contracted organization to determine who is responsible forinstalling the switch network cabling. However, if a cable with an IBM partnumber fails, IBM service is responsible for servicing the cable.

Service VLAN Ethernet devices Ethernet switches or routers that are required for the service virtual localarea network (VLAN) are the responsibility of the customer.

VLAN cabling The organization responsible for the installation of a device is responsiblefor connecting it to the service VLAN.

Clustering with high-performance computing by using InfiniBand hardware 51

Page 62: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 26. Hardware to install and who is responsible for the installation (continued)

Hardware to install Who is responsible for the installation

Fabric Manager software The customer is responsible for updating the Fabric Manager software onthe switch or for updating the fabric management server.

Fabric Manager server The customer is responsible for installing, customizing, and updating thefabric management server.

QLogic Fast Fabric Toolset and hoststack

The customer is responsible for installing, customizing, and updating theQLogic Fast Fabric Toolset and the host stack on the fabric managementserver.

Order of installationUse this information to learn the tasks required for installing a new cluster.

The “Installation coordination work sheet” on page 57 includes a sample work sheet to help youcoordinate tasks among installation teams and members.

This information provides a high-level outline of the general tasks required to install a new cluster. If youunderstand the full installation flow of a new cluster, you can identify the tasks that can be performedwhen you expand your InfiniBand cluster network. Tasks such as adding InfiniBand hardware to anexisting cluster, adding host channel adapters (HCAs) to an existing InfiniBand network, and adding asubnet to an existing network are described. To complete a cluster installation, all devices and units mustbe available before you begin installing the cluster.

The following are the fundamental tasks that are required for installing a cluster:1. The site is set up with power, cooling, floor space requirements, and floor load requirements.2. The switches and processing units are installed and configured.3. The management subsystem is installed and configured.4. The units are cabled and connected to the service virtual local area network (VLAN).5. The units can be verified and discovered on the service VLAN.6. The basic unit operation is verified.7. The cabling for the InfiniBand network is connected.8. The InfiniBand network topology and operation is verified.

Figure 10 on page 54 shows a breakdown of the tasks by major subsystem. The following list illustratesthe preferred order of installation by major subsystem. The order minimizes potential problems withhaving to perform recovery operations as you install, and also minimizes the number of reboots ofdevices during the installation.1. Management consoles and the service VLAN

Note: Management consoles include the Hardware Management Console (HMC) and the serversrunning Cluster Systems Management (CSM), as well as a fabric management server.

2. Servers in the cluster3. Switches4. Switch cable installation

By breaking down the installation by major subsystem, you can see how to install the units in parallel, orhow you might be able to perform some installation tasks for on-site units while waiting for other unitsto be delivered.

52 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 63: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

It is important that you recognize the key points in the installation where you cannot proceed with onesubsystem's installation task before completing the installation tasks in the other subsystem. These arecalled merge points, and are illustrated by using the inverted triangle symbol in Figure 10 on page 54.

The following items are some of the key merge points.1. The management consoles must be installed and configured before starting to cable the service VLAN.

This allows proper Dynamic Host Configuration Protocol (DHCP) management of the IP addressingon the service VLAN. Otherwise, the addressing might be compromised. This is not as critical for thefabric management server. However, the fabric management server must be operational before theswitches are started on the network.

2. You must power on the InfiniBand switches and configure their IP addresses before connecting themto the service VLAN. If this is not done, you must power them on individually and change theiraddresses by logging in to each of them by using their default address.

3. If you have 12X host channel adapters (HCAs) connected to 4X switches, you must power on switchesand cable them to their ports and configure the 12X groupings before attaching cables to HCAs inservers that have been powered on to standby mode or beyond. This action allows automaticnegotiation to the 12X adapters by the HMCs to occur smoothly. When powering up the switches, it isnot guaranteed that the ports will become operational in an order that makes the link appear as 12Xto the HCA. Therefore, you must be sure that the switch is correctly cabled, configured, and ready tonegotiate to the 12X adapters before starting the adapters.

4. To fully verify the InfiniBand network, the servers must be fully installed in order to send data andrun tools that are required to verify the network. The servers must be powered on to standby modefor topology verification.

Note: With QLogic switches, you can use the Fast Fabric Toolset to verify topology. Alternatively, youcan use the Chassis Viewer and Fabric Viewer.

Clustering with high-performance computing by using InfiniBand hardware 53

Page 64: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Important: In each task box of Figure 10, there is also an index letter and number. These indexesindicate the major subsystem installation tasks, and you can use them to cross-reference to the followingdescriptions.

Figure 10. High-level cluster installation flow

54 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 65: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

The task indexes are listed before each of the following major subsystem installation items:

U1: Set up the site for power and cooling, including proper floor cutouts for cable routing.

M1, S1, W1: Place units and frames in their correct positions on the data center floor. This includes, but isnot limited to, HMCs, CSM Management Servers, fabric management servers, cluster servers (with HCAs,I/O devices, and storage devices), and InfiniBand switches. You can physically place units on the floor asthey arrive. However, do not apply power or cable units to the service VLAN or to the InfiniBandnetwork until instructed to do so.

Management console installation steps M2 - M4 have multiple tasks associated with each of them.Review the details in “Installing and configuring the management subsystem” on page 83 to determinewhere you can assign different people to those tasks that can be performed simultaneously.

M2: Perform the initial management console installation and configuration. This includes HMCs, CSM,fabric management server, and DHCP service for the service VLAN.v Plan and set up static addresses for HMCs and switches.v Plan and set up DHCP ranges for each service VLAN.

Important:

v If these devices and associated services are not set up correctly before applying power to the baseservers and devices, you might not be able to correctly configure and control cluster devices.Furthermore, if this setup is done out of sequence, the recovery procedures for doing this part of thecluster installation can be lengthy.

v When a cluster requires multiple HMCs, CSM is required to help manage device discovery. In this case,the setup of CSM and the peer domains on the Cluster-Ready Hardware Server are critical to achievingcorrect cluster device discovery. It is also important to have a central DHCP server, which must be onthe same server as CSM.

M3: Connect server hardware control points to the service VLAN as instructed by server installationdocumentation. The location of the connection is dependent on the server model and might involve aconnection to the bulk power controllers (BPCs) or might be directly attached to the service processor. Donot attach switches to the cluster VLAN at this time.

Also, attach the management consoles to the service and cluster VLANs.

Note: Switch IP addressing must be static. Each switch comes up with the same default address;therefore, you must set the switch address before it is added to the service VLAN. Otherwise, you mustbring the switches one at a time onto the service VLAN and assign a new IP address before bringing thenext switch onto the service VLAN.

M4: Do the portion of final management console installation and configuration that involves assigning oracquiring servers to their managing HMCs and authenticating frames and servers through Cluster-ReadyHardware Server (CRHS). This action is only required when you are using CSM and CRHS.

Note: The double arrow between M4 and S3 indicates that these two tasks cannot be completedindependently. As the server installation portion of the flow is completed, then the management consoleconfiguration can be completed.

Set up remote logging and remote command processing and verify these operations.

When M4 is complete, the bulk power assemblies (BPAs) and cluster service processors must be at powerstandby state. To be at the power standby state, the power cables for each server must be connected tothe appropriate power source. Prerequisites for M4 are M3, S2, and W3; the corequisite for M4 is S3.

Clustering with high-performance computing by using InfiniBand hardware 55

Page 66: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

The following server installation and configuration operations (S2 - S7) can be performed sequentiallyafter step M3 is performed.

M3 This is in the management subsystem installation flow (left column of Figure 10 on page 54), but the tasksare associated with the servers. Attach the cluster server service processors and BPAs to the service VLAN.This task must be done before connecting power to the servers, and after the management consoles areconfigured, so that the cluster servers can be discovered correctly.

S2 To bring the cluster servers to the power standby state, connect the servers in the cluster to theirappropriate power sources. Prerequisites for S2 are M3 and S1.

S3 Verify that the management consoles discovered the cluster servers.

S4 Update the system firmware.

S5 Verify the system operation. Use the server installation information to verify that the system is operational.

S6 Customize logical partition and HCA configurations.

S7 Load and update the operating system.

Complete the following switch installation and configuration tasks W2 - W6.

W2 Power on and configure the IP address of the switch Ethernet connections. This must be done beforeattaching the switch to the service VLAN.

W3 Connect switches to the cluster VLAN. If there is more than one VLAN, all switches must be attached to asingle cluster VLAN, and all redundant switch Ethernet connections must be attached to the same network.Prerequisites for W3 are M3 and W2.

W4 Verify discovery of the switches.

W5 Update the switch software.

W6 Customize InfiniBand network configuration.

Complete C1 - C4 for cabling the InfiniBand network.

Note: It is possible to cable and start networks other than the InfiniBand networks before cabling andstarting the InfiniBand network.

Important: When you attach InfiniBand cables between switches and HCAs, connect the cable to theswitch end first.

C1 Route cables and attach cables ends to the switch ports. Apply labels at this time.

C2 If 12X HCAs are connecting to 4X switches and the links are being configured to run at 12X instead of 4X,the switch ports must be configured in groups of three 4X ports to act as a single 12X link. If you areconfiguring links at 12X, go to C3. Otherwise, go to C4.

Prerequisites for C2 are W2 and C1.

C3 Configure 12X groupings on switches. This must be done before attaching HCA ports. Assure that switchesremain powered on before attaching HCA ports.

The prerequisite is a Yes to decision point C2.

C4 Attach the InfiniBand cable ends to the HCA ports.

The prerequisite is either a No decision in C2 or if the decision in C2 was Yes, then C3 must be done first.

Complete V1 - V3 to verify the cluster networking topology and operation.

56 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 67: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

V1 This task involves checking the topology by using QLogic Fast Fabric tools. There might be alternativemethods for checking the topology. Prerequisites for V1 are M4, S7, W6, and C4.

V2 You must also check for serviceable events reported to the HMC. Furthermore, an all-to-all ping issuggested to test the InfiniBand network before putting the cluster into operation. A vendor might have analternative method for verifying network operation. However, you should consult the HMC, and resolveany open serviceable events. If a vendor has discovered and resolved a serviceable event, then theserviceable event must be closed. The prerequisite for V2 is V1.

V3 You might have to contact service numbers to resolve problems after service representatives leave the site.

Related concepts

“Managing serviceable events on the HMC” on page 23

Installation coordination work sheetUse this work sheet to coordinate installation tasks.

Each organization can use a separate installation work sheet and the work sheet should be completed byusing the flow shown in Figure 10 on page 54.

It is good practice to let each individual and team that participates in the installation review thecoordination work sheet ahead of time and so that they are aware of their dependencies on otherinstallers.

Management console installation steps M2 – M4 from Figure 10 on page 54 have multiple tasksassociated with each of them. You can also review the details for the tasks in Figure 11 on page 85 andsee where you can assign different individuals to those tasks that can be performed simultaneously.

Table 27. Example: Installation coordination work sheet

Organization:

Task Task description Prerequisite tasks Scheduleddate

Completeddate

Table 28 is an example of a completed installation coordination work sheet.

Table 28. Example: Completed installation coordination work sheet

Organization:

Task Task description Prerequisite tasks Scheduleddate

Completeddate

S1 Place model servers on the floor 18Aug2009

M3 Cable the model servers and BPAs to theservice VLAN

18Aug2009

S2 Start the model servers 18Aug2009

S3 Verify discovery of the system 18Aug2009

Clustering with high-performance computing by using InfiniBand hardware 57

Page 68: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 28. Example: Completed installation coordination work sheet (continued)

Organization:

S5 Verify system operation 18Aug2009

Planning for an HPC MPI configurationUse this information to plan a message passing-interface (MPI) for an IBM high-performance computing(HPC) configuration.

The LMC and maximum transfer unit (MTU) settings planned in this topic can be recorded in the“QLogic switch planning work sheets” on page 66, which are used to record switch and subnet managerconfiguration information.

The following assumptions apply to HPC MPI configurations.v Proven configurations for an HPC MPI configuration are limited to:

– Eight subnets for each cluster– Up to eight links out of a server

v Servers are preinstalled in frames.v Servers are shipped with a minimum level of firmware to enable the system to perform an initial

program load (IPL) to POWER Hypervisor standby.

Because HPC applications are designed for performance, it is important to configure the InfiniBandnetwork components with performance as a key element. The main consideration is that the localidentifiers (LID) Mask Control (LMC) field in the switches needs to be set to provide more LIDs for eachport than the default of one. This provides more addressability and better opportunity for using availablebandwidth in the network. The HPC software provided by IBM works best with an LMC value of 2. Thenumber of LIDs is equal to 2x, where x is the LMC value. Therefore, the LMC value of 2 that is requiredfor IBM HPC applications results in four LIDs for each port.

See “Planning for maximum transfer units (MTUs)” on page 34 for planning the maximum transfer unit(MTU) for communication protocols.

Planning for 12X host channel adapter connectionsUse this information for a brief description of host channel adapter (HCA) requirements.

Host channel adapters with 12X capabilities have a 12X connector. Supported switch models only have4X connectors.

You can use a width exchanger cable that allows you to connect a 12X width HCA connector to a single4X width switch port. The exchanger cable has a 12X connector on one end and a 4X connector on theother end.

Tips for planning cluster hardwareUse this information to identify tasks that might be part of planning your cluster hardware.

Consider the following tasks when you plan for cluster hardware:v Determine a convention for frame numbering and slot numbering, where slots are the location of cages

as you go from the bottom of the frame to the top. If you have empty space in a frame, reserve anumber for that space.

v Determine a convention for switch and system unit naming that includes physical location, includingtheir frame numbers and slot numbers.

v Prepare labels for frames to indicate frame numbers.

58 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 69: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

v Prepare cable labels for each end of the cables. Indicate the ports to which each end of the cableconnects.

v Document where switches and servers are located and which Hardware Management Console (HMC)manages them.

v Print a floor plan and keep it with the HMCs.

Planning check listThe planning check list helps you track your progress through the planning process.

Table 29. Planning check list

Step Target date Completed date

Start planning check list

Gather documentation and reviewplanning information for individualunits and applications.

Ensure that you have planned for:

v Servers

v I/O devices

v InfiniBand network devices

v Frames or racks for servers, I/Odevices and switches, andmanagement servers

v Service virtual local area network(VLAN), including:

– Hardware ManagementConsoles (HMCs)

– Ethernet devices

– Cluster SystemsManagement/ManagementServer (CSM/MS) (for multipleHMC environments)

– Network InstallationManagement (NIM) server (forAIX servers that do not haveremovable media capabilities)

– Distribution server (for Linuxservers that do not haveremovable media capabilities)

– Fabric management server

v System management applications(for HMC and CSM)

v Where Fabric Manager will run -host-based subnet manager (HSM)or embedded subnet manager(ESM)

v Fabric management server (forHSM and Fast Fabric Toolset)

v Physical dimension and weightcharacteristics

v Electrical characteristics

v Cooling characteristics

Clustering with high-performance computing by using InfiniBand hardware 59

Page 70: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 29. Planning check list (continued)

Step Target date Completed date

Ensure that you have the requiredlevels of supported firmware,software, and hardware for yourcluster. See “Required level ofsupport, firmware, and devices” onpage 30.

Review the cabling and topologydocumentation for InfiniBandnetworks that is provided by theswitch vendor.

Review “Planning installation flow”on page 50.

Review “Planning for an HPC MPIconfiguration” on page 58

Review “Planning for 12X hostchannel adapter connections” onpage 58, if you are using 12X hostchannel adapters.

Review “Tips for planning clusterhardware” on page 58.

Complete the planning work sheets.

Complete the planning process.

Review readme files and onlineinformation that is related to softwareand firmware to ensure that youhave up-to-date information and thelatest support levels.

Planning work sheetsUse the planning work sheets to help you plan for your cluster environment.

Tip: Keep the planning work sheets in a location that is accessible to the system administrators andservice representatives for the installation, and for future reference during maintenance, upgrade, orrepair actions.

All the work sheets are available in this section.

Using the planning work sheets

The planning work sheets do not cover every situation that you might encounter (especially the numberof slots in a frame, servers in a frame, or I/O slots in a server). However, they can provide enoughinformation on which you can build a custom work sheet for your application. In some cases, you mightfind it useful to create the work sheets in a spreadsheet application so that you can fill out repetitiveinformation. Otherwise, you can devise a method to indicate repetitive information in a formula onprinted work sheets so that you do not have to complete large numbers of work sheets for a large clusterthat is likely to have a definite pattern in frame, server, and switch configuration.

The planning work sheets can be completed in the following order. You will need to refer back to some ofthe work sheets as you record the new information.1. “Cluster summary work sheet” on page 61

60 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 71: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

2. “Frame and rack planning work sheet” on page 633. “Server planning work sheet” on page 644. Applicable vendor software/firmware planning work sheets5. “QLogic switch planning work sheets” on page 666. “QLogic Fabric Management work sheets” on page 727. “Cluster Systems Management planning work sheet” on page 77

For examples of completed planning work sheets, see the examples that follow each blank work sheet.

Cluster summary work sheetUse the cluster summary work sheet to record information for your cluster planning.

Record your cluster planning information in the following work sheet.

Table 30. Cluster summary work sheet

Cluster summary work sheet

Cluster name:

Application (high-performance computing (HPC) or not):

Number and types of servers:

Number of servers and host channel adapters (HCAs) for each server:

Note: If there are servers with varying numbers of HCAs, list the number of servers with each configuration. Forexample, 12 servers with one 2-port HCA; four servers with two 2-port HCAs.

Number and types of switches (include model numbers):

Number of subnets:

List of global identifier (GID) prefixes and subnet masters (assign a number to a subnet for easy reference)

Switch partitions:

Number and types of frames (include systems, switches, management servers, NIM servers, or distribution server:)

Number of Hardware Management Consoles (HMCs):

Will Cluster Systems Management (CSM) and Cluster-Ready Hardware Server be used?

If Yes, server model:

Number and models of fabric management servers:

Number of Service virtual local area networks (VLANs):

Service VLAN domains:

Service VLAN DHCP server locations:

Service VLAN: InfiniBand switches static IP: addresses (not typical):

Service VLAN HMCs with static IP:

Service VLAN DHCP ranges:

Number of cluster VLANs:

Cluster VLAN domains:

Cluster VLAN DHCP server locations:

Cluster VLAN: InfiniBand switches static IP: addresses:

Cluster VLAN HMCs with static IP:

Clustering with high-performance computing by using InfiniBand hardware 61

Page 72: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 30. Cluster summary work sheet (continued)

Cluster summary work sheet

Cluster VLAN DHCP ranges:

NIM server information:

Distribution server information:

NTP server information:

Power requirements:

Maximum cooling required:

Number of cooling zones:

Maximum weight per area: Minimum weight per area:

The following work sheet is an example of a completed cluster summary work sheet.

Table 31. Example: Completed cluster summary work sheet

Cluster summary work sheet

Cluster name: Example

Application (high-performance computing (HPC) or not):HPC

Number and types of servers: (96) 9125-F2A

Number of servers and host channel adapters (HCAs) per server:

Each 9125-F2A has one HCA

Note: If there are servers with varying numbers of HCAs, list the number of servers with each configuration. Forexample, 12 servers with one 2-port HCA; four servers with two 2-port HCAs.

Number and types of switches (include model numbers):

(4) 9140 (require connections for fabric management servers and for 9125-F2A servers

Number of subnets: 4

List of GID prefixes and subnet masters (assign a number to a subnet for easy reference):

Switch partitions:

subnet 1 = FE:80:00:00:00:00:00:00 (egf11fm01)

subnet 2 = FE:80:00:00:00:00:00:01 (egf11fm02)

subnet 3 = FE:80:00:00:00:00:00:00 (egf11fm01)

subnet 4 = FE:80:00:00:00:00:00:01 (egf11fm02)

Number and types of frames (include systems, switches, management servers, Network Installation Management(NIM) servers (AIX), or distribution servers (Linux):

(8) for 9125-F2A

(1) for switches, Cluster Systems Management/Management Server (CSM/MS), and fabric management servers,NIM server on CSM/MS

Number of Hardware Management Consoles (HMCs): 3

Will CSM and Cluster-Ready Hardware Server be used?:

If Yes, server model: Yes

Number and models of fabric management servers: (1) System x 3650

Number of service virtual local area networks (VLANs): 2

62 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 73: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 31. Example: Completed cluster summary work sheet (continued)

Cluster summary work sheet

Service VLAN domains: 10.0.1.x, 10.0.2.x

Service VLAN DHCP server locations: egcsmsv01 (10.0.1.1) (CSM/MS)

Service VLAN: InfiniBand switches static IP: addresses (not typical): Not Applicable (see Cluster VLAN)

Service VLAN HMCs with static IP: 10.0.1.2 - 10.0.1.4

Service VLAN DHCP ranges: 10.0.1.32 - 10.0.1.128

Number of cluster VLANs: 1

Cluster VLAN domains: 10.1.1.x

Cluster VLAN DHCP server locations: egcsmsv01 (10.0.1.1) (CSM/MS)

Cluster VLAN: InfiniBand switches static IP: addresses: 10.1.1.10 - 10.1.1.13

Cluster VLAN HMCs with static IP: Not Applicable

Cluster VLAN DHCP ranges: 10.1.1.32 - 10.1.1.128

NIM server information: CSM/MS

Linux distribution server information: Not applicable

NTP server information: CSM/MS

Power requirements: See site planning

Maximum cooling required: See site planning

Number of cooling zones: See site planning

Maximum weight per area: Minimum weight per area: See site planning

Frame and rack planning work sheetThe frame and rack planning work sheet is used for planning how to populate your frames or racks.

You must know the quantity of each device type, including server, switch, and bulk power assembly(BPA). For the slots, you can indicate the range of slots or drawers that the device populates. A standardmethod for naming slots can either be found in the documentation for the frames or servers, or youcould choose to use EIA heights of 44.45 mm (1.75 in.) as a standard.

You can include frames for systems, switches, management servers, Network Installation Management(NIM) servers for AIX, distribution servers for Linux, and I/O devices.

Table 32. Frame and rack planning work sheet

Frame planning work sheet

Frame number or numbers: _____________________

Frame machine type and model number: _____________________

Frame size: _______________ 482.6 mm (19 in.) or 609.6 mm (24 in.)

Number of slots: ___________________

Slots

Slots Device type (server, switch, BPA)

Indicate machine type and model number

Device name

The following work sheets are an example of a completed frame planning work sheets.

Clustering with high-performance computing by using InfiniBand hardware 63

Page 74: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 33. Example: Completed frame and rack planning work sheet (1 of 3)

Frame planning work sheet (1 of 3)

Frame number or numbers: _______1 - 8______________

Frame machine type and model number: ____for 9125-F2A_________________

Frame size: ____24___________ 482.6 mm (19 in.) or 609.6 mm (24 in.)

Number of slots: ______12_____________

Slots

Slots

1 - 12

Device type (server, switch, BPA)

Indicate machine type and model number

Server 9125-F2A

Device name

egf[frame#]n[node#]

egf01n01 - egf08n12

Table 34. Example: Completed frame and rack planning work sheet (2 of 3)

Frame planning work sheet (2 of 3)

Frame number or numbers: _______10______________

Frame machine type and model number: _____________________

Frame size: ____482.6 mm (19 in.)___________ 482.6 mm (19 in.) or 609.6 mm (24 in.)

Number of slots: ______4_____________

Slots

Slots

1 - 4

5

Device type (server, switch, BPA)

Indicate machine type and model number

Switch 9140

Power unit

Device name

egf10sw1-4

Not applicable

Table 35. Example: Completed frame and rack planning work sheet (3 of 3)

Frame planning work sheet (3 of 3)

Frame number or numbers: _______11______________

Frame machine type and model number: _____482.6 mm (19 in.)________________

Frame size: ____482.6 mm (19 in.)___________ 482.6 mm (19 in.) or 609.6 mm (24 in.)

Number of slots: ______8_____________

Slots

1 - 2

3 - 5

6

Device type (server, switch, BPA)

Indicate machine type and model number

System x 3650

HMCs

CSM/MS

Device name

egf11fm01; egf11fm02

egf11hmc01 - egf11hmc03

egf11csm01

Server planning work sheetYou can use this work sheet as a template for multiple servers with similar configurations.

64 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 75: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

For such cases, you can give the range of names of these servers and where they are located. You canalso use the configuration note to remind you of other specific characteristics of the server. It is importantto note the type of host channel adapters (HCAs) to be used.

Table 36. Server planning work sheet

Server planning work sheet

Name or names: _____________________________________________

Type or types: ______________________________________________

Frame or frames slot or slot: ________________________________________

Number and type of HCAs_________________________________

Number of logical partitions and logical host channel adapters (LHCAs): ____________________________________

IP addressing for InfiniBand: ________________________________________

IP addressing of service virtual local area network (VLAN):_____________________________________________________

IP addressing of cluster VLAN: ________________________________________________

Lgical partition IP addressing: ____________________________________________________________

MPI addressing: ________________________________________________________________

Configuration notes:

HCA information

HCA Capability (sharing) HCA port Switch connection GID prefix

Logical partition information

Logical partition andLHCA (give name)

Operating systemtype

GUID index Shared HCA(capability)

Switch partition

Clustering with high-performance computing by using InfiniBand hardware 65

Page 76: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 36. Server planning work sheet (continued)

Server planning work sheet

The following work sheet shows an example of a completed server planning work sheet.

Table 37. Example: Completed server planning work sheet

Server planning work sheet

Name or names: __________egf01n01 – egf08n12___________________________________

Type or types: _________9125-F2A_________________________________

Frame or frames and slot or slots: _______1-8/1-12_________________________________

Number and type of HCAs___(1) IBM GX+ per 9125-F2A______________________________

Number of logical partitions and LHCAs: ___1 and 4_________________________________

IP addressing for InfiniBand: _______10.1.2.32 - 10.1.2.128, 10.1.3.32 - 10.1.3.128, 10.1.4.x, 10.1.5.x___

IP addressing of service virtual local area network (VLAN): _10.0.1.32 - 10.1.1.128, 10.0.2.32 - 10.0.2.128__

IP addressing of cluster VLAN: ____10.1.1.32 - 10.1.1.128________________________

Logical partition IP addressing: _________10.1.5.32 - 10.1.5.128___________________________

MPI addressing: ________________________________________________________________

Configuration notes:

HCA information

HCA Capability (sharing) HCA port Switch connection GID prefix

C65 Not applicable C65 - T1 Switch1:Frame1=Leaf1

Frame8=Leaf8

FE:80:00:00:00:00:00:00

C65 Not applicable C65 - T2 Switch2:Frame1=Leaf1

Frame8=Leaf8

FE:80:00:00:00:00:00:01

C65 Not applicable C65 - T3 Switch3:Frame1=Leaf1

Frame8=Leaf8

FE:80:00:00:00:00:00:02

C65 Not applicable C65 - T4 Switch4:Frame1=Leaf1

Frame8=Leaf8

FE:80:00:00:00:00:00:03

Logical partition information

Logical partition andLHCA (give name)

Operating systemtype

GUID index Shared HCA(capability)

Switch partition

egf01n01sq01 –egf08n12sq01

AIX 0 Not applicable Not applicable

QLogic switch planning work sheetsUse the appropriate QLogic switch planning work sheet for each type of switch.

66 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 77: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

When documenting connections to switch ports, you can indicate both a short name for your own useand the IBM host channel adapter (HCA) physical locations.

For example, if you are connecting port 1 of a 24-port switch to port 1 of the only HCAs in an IBMPower 575 that you are naming f1n1, you might want to use the short name f1n1-HCA1-Port1 to indicatethis connection.

It might also be useful to note the IBM location code for this HCA port. You can get the location codeinformation specific to each server in the server documentation during the planning process, or you canwork with the IBM service representative at the time of the installation to make the correct notation ofthe IBM location code. Generally, the only piece of information not available during the planning phase isthe server serial number, which is used as part of the location code.

Host channel adapters generally have the location code U[server feature code].001.[server serialnumber]-Px-Cy-Tz where Px represents the planar into which the HCA is plugged, Cy represents theplanar connector into which the HCA is plugged, and Tz represents the HCA port into which the cable isplugged.

Planning work sheet for 24-port switches:

Use this work sheet to plan for a 24-port QLogic switch.

Table 38. QLogic 24-port switch planning work sheet

24-port switch work sheet

Switch model: ______________________________

Switch name: ______________________________ (set by using the setIBNodeDesc command)

CSM device name: ___________________________

Frame and slot: ______________________________

Cluster virtual local area network (VLAN) IP address: _________________________________ Default gateway:___________________

GID-prefix: _________________________________

LMC: _____________________________________ (0=default 2= used in HPC cluster)

NTP server: _____________________________

Switch model, type, machine serial (MTMS): ______________________________ (Fill out during installation)

New administrator password: _________________________ (Fill out during installation)

Remote logging host: __________________________ (CSM/MS if possible)

Ports Connection

1 (16)

2

3

4

5

6

7

Clustering with high-performance computing by using InfiniBand hardware 67

Page 78: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 38. QLogic 24-port switch planning work sheet (continued)

24-port switch work sheet

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

Planning work sheet for switches with more than 24 ports:

Use these work sheets for planning switches with more than 24 ports (those with leaf cards and spines).

The first work sheet is for the overall switch chassis planning. The second work sheet is planning foreach leaf card.

Table 39. Planning work sheet for Director or core switch with more than 24 ports

Director or Core Switch (greater than 24 ports)

68 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 79: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 39. Planning work sheet for Director or core switch with more than 24 ports (continued)

Switch model: _____________________________

Switch name: ____________________________ (set by using the setIBNodeDesc command)

Frame and slot: ____________________________

Chassis IP addresses (9240 has two hemispheres): ____________________________________________________

Spine IP addresses: _____________________________________________________________ (indicate spine slot)

Default gateway: _________________________________________________

GID-prefix: ______________________________

LMC: __________________________________ (0=default 2 used in HPC cluster)

NTP server: _____________________________

Switch model, type, machine serial (MTMS): ___________________________ (Fill out during installation)

New administrator password: _________________________ (Fill out during installation)

Remote logging host: __________________________ (Cluster Systems Management/Management Server (CSM/MS) ifpossible)

The following work sheet can be used to plan for each leaf card.

Table 40. Planning work sheet for Director or core switch with more than 24 ports - leaf card configuration

Leaf card _____ Leaf card ____

Ports Connection Ports Connection

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

The following work sheets are examples of the switch planning work sheets.

Table 41. Example: Planning work sheet for Director or core switch with more than 24 ports (1 of 4)

Director or Core Switch (greater than 24 ports)

Clustering with high-performance computing by using InfiniBand hardware 69

Page 80: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 41. Example: Planning work sheet for Director or core switch with more than 24 ports (1 of 4) (continued)

Switch Model: ____9140_________________________

Switch name: _____egsw01_______________________ (set by using setIBNodeDesc)

Frame and slot: ____f10s01________________________

Chassis IP addresses: _________10.1.1.10___________________________________________

(9240 has two hemispheres)

Spine IP addresses: _____slot1=10.1.1.16, slot2=10.1.1.20 _____________ (indicate spine slot)

Default gateway: _________________________________________________

GID-prefix: _fe.80.00.00.00.00.00.00_________________

LMC: ___________2_______________________ (0=default 2= used in HPC cluster)

NTP server: ______CSM/MS_____________________

Switch model, type, machine serial (MTMS): ___________________________ (Fill out during installation)

New administrator password: _________________________ (Fill out during installation)

Remote logging host: _____CSM/MS_______________ (CSM/MS is recommended)

The following work sheet can be used to plan for each leaf card.

Table 42. Example: Planning work sheet for Director or core switch with more than 24 ports - leaf card configuration(2 of 4)

Leaf card __1___ Leaf card __2__

Ports Connection Ports Connection

1 f01n01-C65-T1 1 f02n01-C65-T1

2 f01n02-C65-T1 2 f02n02-C65-T1

3 f01n03-C65-T1 3 f02n03-C65-T1

4 f01n04-C65-T1 4 f02n04-C65-T1

5 f01n05-C65-T1 5 f02n05-C65-T1

6 f01n06-C65-T1 6 f02n06-C65-T1

7 f01n07-C65-T1 7 f02n07-C65-T1

8 f01n08-C65-T1 8 f02n08-C65-T1

9 f01n09-C65-T1 9 f02n09-C65-T1

10 f01n10-C65-T1 10 f02n10-C65-T1

11 f01n11-C65-T1 11 f02n11-C65-T1

12 f01n12-C65-T1 12 f02n12-C65-T1

Table 43. Example: Planning work sheet for Director or core switch with more than 24 ports - leaf card configuration(3 of 4)

Leaf card __7___ Leaf card __8__

Ports Connection Ports Connection

1 f07n01-C65-T1 1 f08n01-C65-T1

2 f07n02-C65-T1 2 f08n02-C65-T1

70 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 81: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 43. Example: Planning work sheet for Director or core switch with more than 24 ports - leaf card configuration(3 of 4) (continued)

Leaf card __7___ Leaf card __8__

3 f07n03-C65-T1 3 f08n03-C65-T1

4 f07n04-C65-T1 4 f08n04-C65-T1

5 f07n05-C65-T1 5 f08n05-C65-T1

6 f07n06-C65-T1 6 f08n06-C65-T1

7 f07n07-C65-T1 7 f08n07-C65-T1

8 f07n08-C65-T1 8 f08n08-C65-T1

9 f07n09-C65-T1 9 f08n09-C65-T1

10 f07n10-C65-T1 10 f08n10-C65-T1

11 f07n11-C65-T1 11 f08n11-C65-T1

12 f07n12-C65-T1 12 f08n12-C65-T1

A similar pattern to the previous work sheets is used for the next three switches. Only the work sheet forthe fourth switch is shown.

Table 44. Example: Planning work sheet for Director or core switch with more than 24 ports (4 of 4)

Director or Core Switch (greater than 24 ports)

Switch Model: ____9140_________________________

Switch name: _____egsw04_______________________ (set by using the setIBNodeDesc command)

Frame and slot: ____f10s04________________________

Chassis IP addresses: _________10.1.1.13___________________________________________

(9240 has two hemispheres)

Spine IP addresses: _____slot1=10.1.1.19, slot2=10.1.1.23 _____________ (indicate spine slot)

Default gateway: _________________________________________________

GID-prefix: _fe.80.00.00.00.00.00.03_________________

LMC: ___________2_______________________ (0=default 2=if used in HPC cluster)

NTP server: ______CSM/MS_____________________

Switch MTMS: ___________________________ (Fill out during installation)

New administrator password: _________________________ (Fill out during installation)

Remote logging host: _____CSM/MS_______________ (CSM/MS if possible)

Table 45. Example: Planning work sheet for Director or core switch with more than 24 ports - leaf card configuration

Leaf card __1___ Leaf card __2__

Ports Connection Ports Connection

1 f01n01-C65-T4 1 f02n01-C65-T4

2 f01n02-C65-T4 2 f02n02-C65-T4

3 f01n03-C65-T4 3 f02n03-C65-T4

4 f01n04-C65-T4 4 f02n04-C65-T4

Clustering with high-performance computing by using InfiniBand hardware 71

Page 82: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 45. Example: Planning work sheet for Director or core switch with more than 24 ports - leaf cardconfiguration (continued)

Leaf card __1___ Leaf card __2__

5 f01n05-C65-T4 5 f02n05-C65-T4

6 f01n06-C65-T4 6 f02n06-C65-T4

7 f01n07-C65-T4 7 f02n07-C65-T4

8 f01n08-C65-T4 8 f02n08-C65-T4

9 f01n09-C65-T4 9 f02n09-C65-T4

10 f01n10-C65-T4 10 f02n10-C65-T4

11 f01n11-C65-T4 11 f02n11-C65-T4

12 f01n12-C65-T4 12 f02n12-C65-T4

Table 46. Example: Planning work sheet for Director or core switch with more than 24 ports - leaf card configuration

Leaf card __7___ Leaf card __8__

Ports Connection Ports Connection

1 f07n01-C65-T4 1 f08n01-C65-T4

2 f07n02-C65-T4 2 f08n02-C65-T4

3 f07n03-C65-T4 3 f08n03-C65-T4

4 f07n04-C65-T4 4 f08n04-C65-T4

5 f07n05-C65-T4 5 f08n05-C65-T4

6 f07n06-C65-T4 6 f08n06-C65-T4

7 f07n07-C65-T4 7 f08n07-C65-T4

8 f07n08-C65-T4 8 f08n08-C65-T4

9 f07n09-C65-T4 9 f08n09-C65-T4

10 f07n10-C65-T4 10 f08n10-C65-T4

11 f07n11-C65-T4 11 f08n11-C65-T4

12 f07n12-C65-T4 12 f08n12-C65-T4

QLogic Fabric Management work sheetsUse the QLogic Fabric Management work sheet to plan QLogic Fabric Management.

This work sheet highlights information that is important for management subsystem integration inhigh-performance computing (HPC) clusters that use an InfiniBand network. It is not intended to replacethe planning instructions found in the QLogic Installation and Planning Guides.

To plan thoroughly for QLogic Fabric Management, complete the following work sheets.v General QLogic Fabric Management work sheetv Embedded subnet manager work sheet (if applicable)v Fabric management server work sheet

72 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 83: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 47. General QLogic Fabric Management work sheet

Host-based or embedded subnet manager (SM): _______________________________

LMC: _______ (4 is default)

MTU: Chassis: ______________ Broadcast: ________________ MTU rate for broadcast: _____________________

Fabric management server names and addresses on cluster VLAN: ____________________________________________

_____________________________________________________________________________________________

Embedded subnet manager switches: __________________________________________________________________

_____________________________________________________________________________________________

Primary subnet manager locations: ____________________________________________________________________

_____________________________________________________________________________________________

Backup subnet manager locations: ____________________________________________________________________

_____________________________________________________________________________________________

Primary fabric management server as fabric diagnosis collector: _______________________________________________

CSM server addresses for remote logging: _______________________________________________________________

NTP server:

Notes:

The following work sheet shows an example of a completed General QLogic Fabric Management worksheet

Table 48. Example: Completed General QLogic Fabric Management work sheet

General QLogic Fabric Management work sheet

Clustering with high-performance computing by using InfiniBand hardware 73

Page 84: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 48. Example: Completed General QLogic Fabric Management work sheet (continued)

Host-based or embedded SM: _____Host-based____________________

LMC: __2___ (4 is default)

MTU: Chassis: ___4096__________ Broadcast: ___4096___ MTU rate for broadcast: _____4096______

Fabric management server names and addresses on cluster VLAN: _____egf11fm01;egf11fm02__________________________

_____________________________________________________________________________________________

Embedded subnet manager switches: ______Not applicable______________________________________

_____________________________________________________________________________________________

Primary subnet manager location: ___subnet1 & 3=egf11fm01; subnet2 & 4 = egf11fm02 __________

_____________________________________________________________________________________________

Backup subnet manager locations: ___subnet1 & 3=egf11fm02; subnet2 & 4 = egf11fm01 _________

_____________________________________________________________________________________________

Primary Fabric/MS as fabric diagnosis collector: _______egf11fm01 _______________________________

CSM server address or addresses for remote logging: __________10.1.1.1_________________________________

NTP server: 10.1.1.1

Notes:

The following work sheet is for planning an embedded subnet manager. Most HPC cluster installationsuse host-based subnet managers.

Table 49. Embedded subnet manager work sheet

Embedded subnet manager work sheet ESM or HSM to be used? ___________

License obtained from vendor:

CSM server address or addresses for remote logging: _________________________________________________________

Subnet 1 Subnet 2 Subnet 3 Subnet 4 Subnet 5 Subnet 6 Subnet 7 Subnet 8

Primary switch/priority

Backup switch/priority

Backup switch/priority

Backup switch/priority

Broadcast MTU (put ratein parentheses)

LMC

GID prefix

smAppearanceMsgThresh 10 10 10 10 10 10 10 10

Notes:

74 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 85: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

The following work sheet is used to plan fabric management servers. A separate work sheet can be filledout for each server. It is intended to highlight information that is important for management subsystemintegration in HPC clusters with an InfiniBand network. It is not intended to replace planninginstructions found in the QLogic Installation and Planning Guides.

Note: On any given subnet, or group of subnets, the backup fabric management server must have asymmetrical configuration to that of the primary fabric management server. This means that a hostchannel adapter (HCA) device number and port on the backup must be attached to the same subnet as itis to the corresponding HCA device number and port on the primary.

Table 50. Fabric management server work sheet

Fabric management server work sheet (one for each server)

Server name: _________________________________________________________________________________

Server IP address on cluster virtual local area network (VLAN): ____________________________________________

Server model (System x 3550 or 3650): _____________

Frame: ________________________

Number of PCI slots: _______________________________________

Number of HCAs: _________________________________________

Primary/backup/NA HSM: _______________________________________________________________________

Primary data collection point? ___________________________________________________________ (Yes or No)

Local syslogd is syslog, syslog-ng, or other: ____________________________________________________________

CSM server address for remote logging: ______________________________________________________________

Using TCP or UDP for remote logging: ______________________________

NTP server: ___________________________________________________

Subnet management planning

Subnet 1 Subnet 2 Subnet 3 Subnet 4 Subnet 5 Subnet 6 Subnet 7 Subnet 8

HCA number

HCA port

GID prefix

Broadcast MTU (putrate in parentheses)

node_appearance

_msg_thresh

10 10 10 10 10 10 10 10

Primaryswitch/Priority

Backupswitch/Priority

Backupswitch/Priority

10 10 10 10 10 10 10 10

Backupswitch/Priority

Fast Fabric Toolset Planning

Clustering with high-performance computing by using InfiniBand hardware 75

Page 86: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 50. Fabric management server work sheet (continued)

Fabric management server work sheet (one for each server)

Host-based or embedded SM? ________________________________________________ (for FF_ALL_ANALYSIS)

List of switch chassis: ____________________________________________________________________________

_____________________________________________________________________________________________

List of switches running embedded SM: (if applicable) _____________________________________________________

_____________________________________________________________________________________________

Subnet connectivity planning is in the previous Subnet Management planning work sheet.

Chassis list files: _____________________________________________________________________________

Host list files: _______________________________________________________________________________

Notes:

The following work sheet shows an example of a completed fabric management server work sheet.

Table 51. Example: Completed fabric management server work sheet

Fabric management server work sheet (one for each server)

Server name: _______ egf11fm01 ____________________________

Server IP address on cluster virtual local area network (VLAN): ___10.1.1.14 _________________________

Server model (System x 3550 or 3650): 3650

Frame: ___11___

Number of PCI slots: ___2___________________________

Number of HCAs: _____2___________________________

Primary/backup/NA HSM: ____Primary subnet 1 and 3 backup subnet 2 and 4 ____________________

Primary data collection point?__Yes______________________________________________ (Yes or No)

Local syslogd is syslog, syslog-ng, or other: __syslog-ng_________________________________________

CSM server address for remote logging: _____10.1.1.1_________________________________________

Using TCP or UDP for remote logging: ______UDP________________________

NTP server: ___________________________10.1.1.1________________________

Subnet management planning

Subnet 1 Subnet 2 Subnet 3 Subnet 4 Subnet 5 Subnet 6 Subnet 7 Subnet 8

HCA number 1 1 2 2

HCA port 1 2 1 2

GID prefix

(all start w/fe.80.00.00.00.00.00)

00 01 02 03

Broadcast MTU (put ratein parentheses)

5 (4096) 5 (4096) 5 (4096) 5 (4096)

76 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 87: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 51. Example: Completed fabric management server work sheet (continued)

Fabric management server work sheet (one for each server)

node_appearance

_msg_thresh

10 10 10 10 10 10 10 10

Primary switch/priority 2 1 2 1

Backup switch/priority

Backup switch/priority 10 10 10 10 10 10 10 10

Backup switch/priority

Fast Fabric Toolset planning

Host-based or embedded SM? _______ Host-based_____________________ (for FF_ALL_ANALYSIS)

List of switch chassis: ___10.1.1.16; 10.1.1.17; 10.1.1.18; 10.1.1.19 _____________________________

__________________________________________________________________________________________

List of switches running embedded SM: (if applicable) _____ Not applicable _________________________

__________________________________________________________________________________________

Subnet connectivity planning is in the Subnet Management Planning work sheet.

Chassis list files: _____AllSwitches (list all switches) _________________________________________

Host list files: _______AllFM (list all Fabric MS) ____________________________________________

Notes:

Cluster Systems Management planning work sheetUse the Cluster Systems Management (CSM) planning work sheet to plan for your CSM managementservers.

The CSM work sheet is intended to highlight information that is important for management subsystemintegration in high-performance computing (HPC) clusters with an InfiniBand network. It is not intendedto replace planning instruction in the CSM Installation and Planning Guide.

If you have multiple Cluster Systems Management/Management Server (CSM/MS), complete a worksheet for each server. The Switch Remote Command Setup and the fabric management server RemoteCommand Setup allow for multiple devices to be defined.

The Event Monitoring work sheet allows for multiple Sensor and Response mechanisms to bedocumented.

Table 52. CSM planning work sheet

CSM Planning work sheet

Clustering with high-performance computing by using InfiniBand hardware 77

Page 88: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 52. CSM planning work sheet (continued)

CSM/MS name: ____________________________________________________________

CSM/MS IP addresses: Service VLAN:___________________________ Cluster VLAN: _______________

CSM/MS operating system: ___________________________________________________

NTP Server: _________________________________________________________________

Server model: ___________________ Frame: ______________________________________

syslog or syslog-ng or other syslogd command ______________________________________________________

Switch Remote Command Setup

DeviceType = IBSwitch:QLogic (for QLogic Switches) __________________________________________

RemoteShellUser = admin ____________________ (it must be different from admin)

RemoteShell = ssh

RemoteCopyCmd = /usr/bin/scp

Device names/addresses of switches: _______________________________________________________________

______________________________________________________________________________________________

______________________________________________________________________________________________

Device groups for switches:

Fabric management server Remote Command Setup

DeviceType = FabricMS

RemoteShellUserID = ____________________ (root = default)

RemoteShell = ssh

RemoteCopyCmd = /usr/bin/scp

Device names or addresses of Fabric/MS: ___________________________________________________________

Device groups for Fabric/MS: ____________________________________________________________________

Primary Fabric/MS for data collection:

The following work sheet is an example of a completed CSM planning work sheet.

Table 53. Example: Completed CSM planning work sheet

CSM Planning work sheet

CSM/MS Name: _______egcsm01_____________________________________________________

CSM/MS IP addresses: service VLAN:___10.0.1.1 10.0.2.1________________ Cluster VLAN: __10.1.1.1___

CSM/MS Operating System: ________AIX 5.3_________________________________

NTP Server: _______________CSM/MS__________________________________________________

Server Model: _System p_520_ Frame: 11__

syslog or syslog-ng or other syslogd command ___________syslog_____________________

Switch Remote Command Setup

78 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 89: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 53. Example: Completed CSM planning work sheet (continued)

DeviceType = IBSwitch:QLogic (for QLogic Switches) _______IBSwitch::QLogic_____________________

RemoteShellUser = admin ____________________ (it must be different from admin)

RemoteShell = ssh

RemoteCopyCmd = /usr/bin/scp

Device names/addresses of switches: _____egf11sw01, egf11sw02, egf11sw03, egf11sw04______

______________________________________________________________________________________________

______________________________________________________________________________________________

Device groups for switches: AllIBSwitches

Fabric management server Remote Command Setup

DeviceType = FabricMS

RemoteShellUserID = __root___ (root = default)

RemoteShell = ssh

RemoteCopyCmd = /usr/bin/scp

Device names or addresses of Fabric/MS: _____egf11fm01; egf11fm02____________________________

Device groups for Fabric/MS: __________AllFMS MasterFMS BackupFMS_______________________

Primary Fabric/MS for data collection: egf11fm01

The following CSM Event Monitoring work sheet is used to document multiple Sensor and Responsemechanisms.

Table 54. CSM event monitoring work sheet

CSM Event Monitoring work sheet

syslog or syslog-ng or other: ___________________________________________________

Accept logs from IP address (0.0.0.0): ________________________________________________________ (yes=default)

Fabric management server logging: TCP or UDP? ___________ port: _______ (514 default)

Fabric management server IP addresses:__________________________________________________________________________________

Switch logging is UDP protocol: port: __________________ (514 default)

Switch chassis IP address:_____________________________________________________________________________________________________

___________________________________________________________________________________________________________________________

Notice

File or named pipe

Info

File or named pipe

Sensor Condition Response

Clustering with high-performance computing by using InfiniBand hardware 79

Page 90: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 54. CSM event monitoring work sheet (continued)

CSM Event Monitoring work sheet

Notes:

The following work sheet shows an example of a completed CSM event monitoring work sheet.

Table 55. Example: Completed CSM event monitoring work sheet

CSM Event Monitoring work sheet

syslog or syslog-ng or other: _______syslog_______________________________________

Accept logs from IP address (0.0.0.0): _________Yes____________________________________ (yes=default)

Fabric management server logging: TCP or UDP? __UDP________ port: __514__ (514 default)

Fabric management server IP addresses: _________10.1.1.14; 10.1.1.15_______________________

Switch logging is UDP protocol: port: ___514_______ (514 default)

Switch chassis IP address: __________10.1.1.16; 10.1.1.17; 10.1.1.18;10.1.1.19____________________________________________________

___________________________________________________________________________________________________________________________

Notice

File or named pipe

Info

File or named pipe

Sensor Condition Response

/var/log/csm/fabric.syslog.notices

/var/log/csm/fabric.syslog.info

AIXSyslogSensor LocalAIXNodeSyslog LogNodeErrorLogEntry

Notes:

Installing a high-performance computing (HPC) cluster with anInfiniBand networkLearn how to physically install your management subsystems and InfiniBand hardware and software,including the required steps for the hardware and software installation.

Do not proceed unless you have read “Planning for clusters” on page 28, or unless your role in theinstallation has been planned by someone who has read that information.

Before beginning any installation procedure, for the most current release information, see the IBMClusters with the InfiniBand Switch Web site.

The information in this topic does not cover the installation of I/O devices other than those in theInfiniBand network. All I/O devices that are not InfiniBand devices are considered part of the serverinstallation procedure.

80 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 91: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Overview of the installation tasksLearn about the general tasks for installing a high-performance computing (HPC) cluster with anInfiniBand network.

The general installation process consists of the following tasks:1. Separate the installation tasks so that they are based on the generalized tasks and the people

responsible for them. Understand the installation responsibilities by organization and theresponsibilities of units and devices.

2. Ensure that you understand the planning installation flow. It is crucial to understand the merge pointsto coordinate a successful installation.

3. The detailed installation instructions follow a specific order of installation. The major task numbersfound in the Order of installation topic are referenced in the detailed installation instructions. Thedetailed instructions might contain several steps for completing a major task.

4. If you are not performing a new installation, you might be expanding an existing cluster or addingfunctions to support an InfiniBand network. In this case, see the topics about cluster expansion oradditions to complete the partial installation.

Related concepts

“Installation responsibilities by organization” on page 50Use this information to find who is responsible for various aspects of the installation.“Installation responsibilities of units and devices” on page 51Use this information to determine who is responsible for the installation of units and devices.“Planning installation flow” on page 50Use this information to learn about the key installation points, the organizations responsible forinstallation, the installation responsibilities for units and devices, and the order that components areinstalled. Use the installation coordination work sheets to record the information you planned in thesetopics.“Order of installation” on page 52Use this information to learn the tasks required for installing a new cluster.

Installation responsibilities for the IBM service representativesThe IBM service representatives have installation responsibilities that include installing IBM machinetypes that are IBM installable versus those that are customer installable. In addition to the normal repairresponsibilities during an installation, IBM service is responsible for repairing the InfiniBand cables andhost channel adapters (HCAs).

IBM service representatives are responsible for implementing the following installation instructions:v For an IBM installable Hardware Management Console (HMC), see “Installing the Hardware

Management Console” on page 87.v For IBM installable servers, see “Installing and configuring the cluster server hardware” on page 107.

Cluster expansion or partial installationIf you are performing an expansion or partial installation, you need to perform a subset of the tasks thatare required for a full installation.

Use the following table to determine which tasks need to be performed for a cluster expansion or partialinstallation.

Clustering with high-performance computing by using InfiniBand hardware 81

Page 92: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 56. Cluster expansion or partial installation determination

AddingInfiniBandhardware to anexisting cluster(switches andhost channeladapters (HCAs))

Adding newservers to anexistingInfiniBandnetwork

Adding HCAs toan existingInfiniBandnetwork

Adding a subnetto an existingInfiniBandnetwork

Adding serversand a subnet toan existingInfiniBandnetwork

“Setting up sitepower, cooling,and floor”

Yes Yes Floor tile cut-outsfor cables

Yes Yes

“Installing andconfiguring themanagementsubsystem” onpage 83

Yes Yes (forinstallation

images)

No Yes Yes

“Installing andconfiguring thecluster serverhardware” onpage 107

Yes Yes Yes No Yes

“Installing andconfiguringvendor InfiniBandswitches” on page116

Yes Yes, if ClusterSystems

Management(CSM) and

Cluster-ReadyHardware Server(CRHS) are used1

No Yes Yes

“Attaching cablesto the InfiniBandnetwork” on page120

Yes Yes Yes Yes Yes

“Verifying theInfiniBandnetwork topologyand operation”on page 122

Yes Yes Yes Yes Yes

1 This occurs when:

v A single Hardware Management Console (HMC) is in an existing cluster, and at least one more HMC is added tothe cluster.

v Servers are being added to an existing cluster.

v Servers that were added require you to add one or more HMCs.

v You must use CSM and CRHS, and configure the switches with a static Internet Protocol (IP) address on thecluster network.

Setting up site power, cooling, and floorUse this information to learn about several major tasks that are part of the installation flow.

The site setup for power, cooling, and the floor encompasses major task U1 as shown in the Figure 10 onpage 54. The setup for the power, cooling, and floor construction must be completed before installing thecluster. This setup meets all documented requirements for the individual units, frames, systems, andadapters in the cluster. These tasks are performed by the customer, an IBM installation planningrepresentative, or a third-party contractor. All applicable IBM and vendor documentation must beconsulted.

82 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 93: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Note: If you are installing host channel adapters (HCAs) into existing servers, you need only performoperations that pertain to cable routing and floor tile cut-outs.

Installing and configuring the management subsystemUse this information to learn about the requirements for installing and configuring the managementsubsystem.

The management subsystem installation and configuration encompass major tasks M1 through M4 asshown in Figure 10 on page 54.

This procedure is not a detailed description of how to install the management subsystem components,because such procedures are described in detail in documentation for the individual devices andapplications. This procedure documents the order of installation and key points that you need to considerin installing and configuring the management consoles.

This task is the most complex area of a high-performance computing (HPC) cluster installation. It isaffected by, and affects, the other areas (such as server installation and switch installation). Many taskscan be performed simultaneously, while others must be done in a specific order.

You install and configure the following items:v Hardware Management Console (HMC)v A service virtual local area network (VLAN)v A cluster VLANv A fabric management serverv A Cluster System Management/Management Server (CSM/MS)

You need to set up and configure a Network Installation Management (NIM) server that contains aShared Product Object Tree (SPOT) to run diagnostics for servers without removable media capabilities(CD and DVD drives). The diagnostics are only available with the AIX operating system and require anAIX NIM SPOT even if logical partitions are running the Linux operating system.

If your logical partitions are running the Linux operating system, you also need a distribution server forupdating the operating system that is to be used on the logical partitions in servers without removablemedia capabilities.

While it is typical to use the CSM/MS as the Dynamic Host Configuration Protocol (DHCP) server forthe service VLAN, if a separate DHCP server is installed, follow the DHCP installation tasks in the“Installing the CSM Management Server” on page 89 topic.

The management consoles that are to be installed are the HMC, the CSM/MS, and the fabricmanagement server. The management consoles are the key to successfully installing and configuring thecluster because they are central to the management subsystem. Before you perform any startup andconfiguration, these devices must be installed and configured so that they are ready to discover andmanage the rest of the devices in the cluster.

During management subsystem installation and configuration, perform the following tasks, which areillustrated in Figure 11 on page 85. While many of the steps within the procedures can be performedsimultaneously, pay attention to where they converge and where one task might be a prerequisite foranother task. When there is a prerequisite, a triangle symbol is shown. The following steps are the majortasks for installation:1. Physically place units on the data center floor.2. Install and configure service and cluster VLAN devices by using the procedure in “Installing and

configuring service VLAN devices” on page 87.

Clustering with high-performance computing by using InfiniBand hardware 83

Page 94: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

3. Install the HMCs. See “Installing the Hardware Management Console” on page 87.4. Install the CSM Management Server. See “Installing the CSM Management Server” on page 89.5. Install the operating system installation servers. See “Installing operating system installation servers”

on page 90.6. Install the fabric management server. See “Installing the fabric management server” on page 91.7. Perform server installation and configuration with management consoles. See “Installing and

configuring the cluster server hardware” on page 107.8. Configure remote logging from the switches and the fabric management servers to the CSM/MS. See

“Setting up remote logging” on page 95.9. Configure the remote command processing capability from the CSM/MS to the switches and fabric

management servers. See “Setting up remote command processing” on page 103.

The tasks have reference labels to use as cross-references between figures and procedures. The first isfrom Figure 11 on page 85 and the second is from Figure 10 on page 54. For example, E1 (M1) indicatestask label E1 in the following figure and task label M1 in the Figure 10 on page 54 topic.

Steps that have a shaded background are steps that are performed under “Installing and configuringvendor InfiniBand switches” on page 116.

84 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 95: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Installing and configuring the management subsystem for a cluster expansion oradditionIf you are adding or expanding InfiniBand network capabilities to an existing cluster, you might need toapproach the management subsystem installation and configuration differently than with a new clusterinstallation.

Figure 11. Management subsystem installation tasks

Clustering with high-performance computing by using InfiniBand hardware 85

Page 96: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

The flow for the management subsystem installation and configuration task is based on a new clusterinstallation, but it indicates where variances occur for expanding clusters.

The following table shows how a new cluster installation is affected or altered by various expansionscenarios.

Table 57. Impact of cluster expansions

Scenario You need to:

Adding InfiniBand hardware to an existing cluster(switches and host channel adapters (HCAs))

1. Add cables to the Ethernet ports on the InfiniBandswitch service subsystem

2. Consider additional service subsystem Ethernetswitches or routers to accommodate new InfiniBandswitches

3. Install a fabric management server

4. If there are multiple HMCs in the existing cluster,you must use Cluster Systems Management (CSM)and Cluster-Ready Hardware Server (CRHS)

5. Add remote syslog function capability from the fabricmanagement server and switches to CSM

6. Add remote processing capability from the CSM tothe fabric management server and switches

Adding new servers to an existing InfiniBand network 1. Add cables to the subsystem Ethernet ports on theserver.

2. Build an update mechanism on the operating systemfor new servers that have no removable mediacapabilities.

3. Consider additional HMCs to accommodate the newservers. If expanding beyond a single HMC oradding Cluster Systems Management/ManagementServer (CSM/MS) and CRHS, you might need tounconfigure the current Dynamic Host ConfigurationProtocol (DHCP) services on the existing HMC, andreconfigure DHCP on the CSM/MS or on anotherDHCP server.

4. Consider additional Ethernet switches or routers forthe service subsystem Ethernet switches toaccommodate new switches and servers

Adding HCAs to an existing InfiniBand network Adding HCAs to an existing InfiniBand network shouldnot affect the management or service subsystem.

Adding a subnet to an existing InfiniBand network 1. Add cables to the Ethernet ports on the InfiniBandswitch service subsystem

2. Consider additional Ethernet switches or routers forthe service subsystem to accommodate newInfiniBand switches

3. Consider additional fabric management servers,which would affect CSM event monitoring andremote command access of the additional fabricmanagement servers

4. Add remote syslog function capability from the newfabric management server and switches to CSM

5. Add remote processing capability from CSM to newfabric management server and switches

86 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 97: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 57. Impact of cluster expansions (continued)

Scenario You need to:

Adding servers and a subnet to an existing InfiniBandnetwork

1. Add cables to the Ethernet ports on the InfiniBandswitch service subsystem

2. Consider additional cables to the Ethernet ports onthe service subsystem of the server

3. Build operating system update mechanisms for newservers that have no removable media capabilities

4. Consider additional HMCs to accommodate the newservers. If expanding beyond a single HMC oradding CSM/MS and Cluster-Ready HardwareServer, you might need to unconfigure the currentDHCP services on the existing HMC and reconfigureDHCP on the CSM/MS, or on another DHCP server.

5. Consider additional Ethernet switches or routers forthe service subsystem Ethernet switches toaccommodate new InfiniBand switches and servers

6. Consider additional fabric management servers,which would affect CSM event monitoring andremote command access of the additional fabricmanagement servers

7. Add remote syslog function capability from the newfabric management server and switches to CSM

8. Add remote processing capability from the CSM tothe new fabric management server and switches

Installing and configuring service VLAN devicesUse this procedure to learn about the correct times to cable units to the service VLAN if you areresponsible for installing and configuring the service virtual local area network (VLAN).1. E1 (M1): Physically locate the service and cluster VLAN Ethernet devices on the data center floor.2. E2 (M2): Install and configure the service and cluster VLAN Ethernet devices by using the

documentation for the Ethernet devices and any configuration details provided by the HardwareManagement Console (HMC) installation information.

3. (M3): Do not cable management consoles, servers, or switch units to the VLANs until you areinstructed to do so within the installation procedure for each management console.

Note: The correct ordering of management console installation steps and cabling to the VLANs isimportant for a successful installation. Failure to follow the installation order can result in longrecovery procedures.

Installing the Hardware Management ConsoleThis installation procedure is for an IBM service representative.

Before starting this installation procedure, obtain the Hardware Management Console (HMC) installationinstructions. Do not use these instructions until you are directed to do so within this procedure.

During the HMC installation, see the HMC information about the “Cluster summary work sheet” onpage 61, which should have been completed during the planning phase for the cluster.

Notes:

v If there are multiple HMCs on the service virtual local area network (VLAN), do not set up the HMCas a Dynamic Host Configuration Protocol (DHCP) server as instructed. This would result in multipleDHCP servers on the service VLAN.

Clustering with high-performance computing by using InfiniBand hardware 87

Page 98: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

v The Cluster System Management (CSM) should be used as the systems management application. It isrequired to be installed with Cluster-Ready Hardware Server (CRHS) under the following conditions:– You have more than one HMC.– You have opted to install CSM and CRHS in anticipation of future expansion.

To install the HMC, complete the following steps:

Note: The following tasks have reference labels to use as cross-references between figures andprocedures. The first is from Figure 11 on page 85 and the second is from Figure 10 on page 54. Forexample, E1 (M1) indicates task label E1 in that figure and task label (M1) in the High-level clusterinstallation flow topic.1. H1 (M1): Perform the physical installation of the HMC hardware on the data center floor. HMCs

might have a maximum distance restriction from the devices that they manage. Generally, you wantto minimize the distance from the HMCs to their managed servers so that IBM service representativescan perform tasks efficiently. Also, if you are adding servers to an existing cluster, you might need toinstall one or more HMCs to manage the additional servers. If this is not a new cluster installation,you might not need to add more HMCs to the cluster.

2. H2 (M2): Before proceeding, ensure that the server frames and systems are not powered on and arenot attached to the service VLAN.

3. H2 (M2): Perform the initial installation and configuration of the HMCs by using the HMCdocumentation. For details, see Managing the Hardware Management Console.

Important: When the HMC installation documentation directs you to enable DHCP on the HMC,stop using the HMC installation procedure and go to step 4. You will be instructed to return to theHMC documentation after the appropriate steps have been completed in this procedure.

4. H3 (M2): Choose from the following items, and go to the appropriate step for your cluster:v If you are installing CSM and enabling CRHS, or if you are using a DHCP server for the service

VLAN that is not an HMC, go to step 5.v If you are installing a cluster with a single HMC and you are not enabling a CRHS, go to step 6 on

page 89.5. H4 (M2): To install the HMCs in the management subsystem with CSM and CRHS or a DHCP server

that is not an HMC, use the following procedure:

Notes: Perform this procedure if you are:v Installing a new cluster with CSM and a CRHS.v Adding an HMC to a cluster that already has CSM and a CRHS.v Adding an HMC to a cluster that has only a single HMC.v Adding an InfiniBand network to an existing cluster that has multiple HMCs and that is not

currently using CSM and CRHS.a. Enable the CRHS with CSM to connect correctly to the service processors and bulk power

controllers (BPCs) by using the systemid command on the Cluster Systems Management/Management Server (CSM/MS) to manage passwords. If you do not use the systemidcommand, CSM and CRHS cannot communicate correctly with the bulk power assemblies(BPAs) and service processors.

b. Disable the DHCP server on the HMC and assign the HMC a static IP address so that there isonly one DHCP server on the Ethernet service VLAN, and so that device discovery occurs fromthe CRHS in CSM.

Note: If the HMC is currently managing devices, disabling DHCP on the HMC temporarilydisconnects the HMC from its managed devices. If the current cluster already has CSM/MS anda CRHS or if the current cluster does not require an additional HMC, go to step 6 on page 89.

88 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 99: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

c. Change existing HMCs from a DHCP server to a static IP address so that the address is withinthe Ethernet service VLAN subnet (provided by the customer) on the cluster, but outside of theDHCP address range.

d. Restart the HMC.6. H5 (M2): Return to the HMC installation documentation and finish the installation and configuration

procedures. However, do not attach the HMC cables to the service VLAN until instructed to do so instep 9 of this procedure. After finishing HMC installation and configuration procedures, continue withstep 7.

7. H6 (M2): Ensure that your HMCs are at the correct software and firmware levels. See the IBMClusters with the InfiniBand Switch Web site for information regarding the most current level of theHMC. Follow the links in the readme file to the appropriate download Web sites and instructions.

8. Do not proceed until the following requirements have been met:a. The CSM/MS is set up as a DHCP server as described in “Installing the CSM Management

Server,” or you have only a single HMC that remains as a DHCP server.b. The Ethernet devices for the service VLAN are installed and configured, as described in “Installing

and configuring service VLAN devices” on page 87.c. If the CSM/MS is not the DHCP server for the service VLAN, you must wait for the DHCP server

to be installed, configured, and cabled to the service VLAN.9. H7 (M3): Cable the HMCs to the service VLAN.

Installing the CSM Management ServerThe Cluster Systems Management/Management Server (CSM/MS) installation is performed by thecustomer.

Before proceeding, obtain the following information:v The CSM Planning and Installation Guide for your AIX and Linux version. The guide can be found in the

CSM library in the IBM Cluster Information Center.v The server installation information for the CSM/MS machine type and model.

The following procedure is for installing the CSM/MS in the high-performance computing (HPC) cluster.See the “Cluster Systems Management planning work sheet” on page 77 that you completed during theplanning phase for the cluster.

Note: The tasks have reference labels to use as cross-references between figures and procedures. The firstis from Figure 11 on page 85 and the second is from Figure 10 on page 54. For example, E1 (M1) indicatestask label E1 in the figure and task label M1 in the High-level cluster installation flow topic.1. CM1 (M1): Perform the physical installation of the CSM/MS on the data center floor. If you are using

a separate Dynamic Host Configuration Protocol (DHCP) server for the service or cluster virtual localarea network (VLAN) that is being installed as part of this installation activity, also physically place iton the data center floor.

2. CM2 (M2): Perform the procedures in the CSM Installation Guide. When performing those procedures,you must ensure that you complete the following steps. If you are using a separate DHCP server forthe service VLAN, also perform the following steps for it, and for the CSM/MS. Do not perform thesteps for configuring DHCP on the CSM/MS.a. Install the CSM/MS system hardware.b. Update the operating system on the CSM/MS.c. Install the CSM code on the CSM/MS.d. As appropriate, enable the CSM/MS as the DHCP server for the service VLAN and the cluster

VLAN. If you are using a separate DHCP server, perform this step on that server instead of theCSM/MS.

e. Define the subnet ranges for the service and cluster VLANs. If you are using a separate DHCPserver, perform this step on that server instead of on the CSM/MS.

Clustering with high-performance computing by using InfiniBand hardware 89

Page 100: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

f. Configure the DHCP ranges for the servers and bulk power controllers (BPCs). If you are using aseparate DHCP server, perform this step on that server instead of the CSM/MS.

g. Add the planned static IP addresses for the HMCs to the Cluster-Ready Hardware Server (CRHS)peer domain.

3. Do not proceed until the service and cluster VLANs Ethernet devices have been installed andconfigured as described in “Installing and configuring service VLAN devices” on page 87.

4. CM3 (M3): Cable the CSM/MS to the service and cluster VLANs. If you are using a separate DHCPserver, cable it to the appropriate VLANs also.

5. CM4 (M4): Start the DHCP server on the CSM/MS, or if applicable, on a separate DHCP server. Thisstep blocks other installation tasks for servers and management consoles that require DHCP servicefrom CRHS.

6. Use the information from the “Cluster Systems Management planning work sheet” on page 77 toenter the configuration information for the server in its /etc/motd.

Other procedures that involve the CSM/MS are part of L1 - L3 and R1 - R2, which are all part of majortask M4.v “Setting up remote logging” on page 95v “Setting up remote command processing” on page 103

Installing operating system installation serversUse this procedure to install the operating system installation servers that are used with diagnostic tools.

This procedure is performed by the customer.

While there is reference to installing operating system installation servers, this procedure concentrates onthe need for diagnostic service using an operating system installation server.

In particular, diagnostic tools for Power Systems servers are available only in the AIX operating system.You need an AIX Shared Product Object Tree (SPOT), even if you are running another operating systemin your logical partitions on servers with no removable media (CD or DVD) capabilities.

Before proceeding, obtain documentation on the server, network installation management (NIM) server,and distribution server. The following documents can be used:v Server installation guide for the operating system installation server (AIX NIM or Linux distribution

server)v For the NIM server, obtain installation information from AIX documentationv For the Linux server, obtain distribution documentation

Depending on where you install the operating system installation server, you might also need thedocumentation for “Installing the CSM Management Server” on page 89.

Note: The tasks have reference labels to use as cross-references between figures and procedures. The firstis from Figure 11 on page 85 and the second is from Figure 10 on page 54. For example, E1 (M1) indicatestask label E1 in the figure and task label M1 in the High-level cluster installation flow topic.1. I1 (M1): Physically place the NIM (for AIX) and distribution servers (for Linux) on the data center

floor.2. Do not proceed until you have started the Dynamic Host Configuration Protocol (DHCP) server on

the Cluster Systems Management/Management Server (CSM/MS) as described in “Installing the CSMManagement Server” on page 89.

3. I2 (M2): If you plan to have servers with no removable media (CD or DVD) capabilities, build an AIXNIM SPOT on your chosen server to enable diagnostics. See the NIM information in the AIXdocumentation.

90 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 101: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Note: Since the diagnostics are available only in the AIX operating system, you need an AIX SPOT,even if you are running another operating system in your logical partitions.

4. I2 (M2): If you have servers with no removable media (CD or DVD) capabilities, and you are usingLinux in your partitions, install a distribution server.

5. I3 (M4): Cable the operating system installation servers to the cluster virtual local area network(VLAN), not to the service VLAN.

Installing the fabric management serverThe installation of the fabric management server is performed by the customer.

The fabric management server provides the following functions, which are installed and configured inthis procedure:v Host-based Fabric Manager functionv Fast Fabric Toolset

Note: This procedure is written from the perspective of installing a single fabric management server.Using the instructions in the Fast Fabric Toolset Users Guide, you can use the ftpall command to copycommon configuration files from the first fabric management server to other fabric management servers.Use care with the subnet manager configuration files because certain parameters (similar to the globalidentifier (GID) prefix) are not common between all fabric management servers.

Before proceeding with this procedure, obtain the following documentation:v IBM System x 3550 or 3650 Installation Guide

v Linux distribution documentationv Fabric Manager Users Guide

v QLogic InfiniServ host stack documentationv Fast Fabric Toolset Users Guide

There is a point in this procedure that cannot be passed until the QLogic switches are installed, poweredon, and configured, and the cluster virtual local area network (VLAN) Ethernet devices are configuredand powered on. You need to coordinate with the teams performing those installation activities.

Use the following procedure for installing the fabric management server. It references QLogicdocumentation for detailed installation instructions. See the “QLogic Fabric Management work sheets” onpage 72 that you completed during the planning phase for the cluster.

Note: The tasks have reference labels to use as cross-references between figures and procedures. The firstis from Figure 11 on page 85 and the second is from Figure 10 on page 54. For example, E1 (M1) indicatestask label E1 in the figure and task label (M1) in the High-level cluster installation flow topic.1. F1 (M1): Physically place the fabric management server on the data center floor.2. F2 (M2): Install and configure the operating system on the fabric management server.3. F3 (M2): If you are connecting the fabric management server or servers to a public Ethernet network

(not the service, nor the cluster VLAN), do so at this time.4. F4 (M2): Install and cable the host channel adapters (HCAs) in the fabric management servers. The

HCAs must be installed before proceeding to the next step. Cabling of the HCAs to the fabric canwait, but do not start the Fabric Manager software until the fabric management server HCAs havebeen cabled to the fabric.

5. F5 (M2): To install the QLogic InfiniServ host stack and Fast Fabric Toolset, use the InfiniServ FabricAccess Software Users Guide. The following items are the key steps to the installation:a. Untar the InfiniServ tar file.b. Run the INSTALL script by using the appropriate flags as described in the QLogic

documentation.

Clustering with high-performance computing by using InfiniBand hardware 91

Page 102: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Note: Do not enable IPoIB on the fabric management server, or do not install the IPoIBcapability. Otherwise, the multicast groups might be negatively affected by IPoIB on the fabricmanagement server by setting up groups that are not valid for the compute servers and I/Oservers on the fabric.

c. Restart the InfiniServ stack.6. F5 (M2): Set up the Fast Fabric Toolset by completing the following tasks:

a. Configure the Fast Fabric Toolset according to the instructions in the Fast Fabric Toolset UsersGuide. When configuring the Fast Fabric Toolset, consider the following application of the FastFabric Toolset within high-performance computing (HPC) clusters:v The master node, referred to in the Fast Fabric Toolset Users Guide, is considered to be the Fast

Fabric Toolset host in IBM HPC clusters.v You do not have to set up rsh and ssh access to the servers from the Fast Fabric Toolset host.v You do not use the message-passing-interface (MPI) performance tests, because they are not

compiled for the IBM host stack.v HPL is not applicable.v You generally only use parameters that list the switch chassis.v You never issue commands to hosts.

b. Update the following Fast Fabric configuration files. These files list the switch and FabricManager servers that make up the fabric. This function provides the ability to report and processcommands across the fabric concurrently.v The /etc/sysconfig/iba/chassis file must have the list of all the switch chassis in the fabric.

Each chassis is listed on a separate line of the file. You can use either the IP address or theresolvable host name for the chassis address.

v If you planned for groups of switches, create a file for each group.v The /etc/sysconfig/iba/hosts file contains a list of all the fabric management servers.v If you planned for groups of fabric management servers, create a file for each group.v Set up the /etc/sysconfig/fastfabric.conf file with the appropriate FF_ALL_ANALYSIS and

FF_FABRIC_HEALTH environmental variable values. These values include the fabric, chassis,and subnet manager analysis. The subnet manager analysis depends on the type of subnetmanager you are using. There is a commented entry for FF_ALL_ANALYSIS that includes allpossible analysis tools. You only need the hostsm or esm (embedded subnet manager) entry.– If you have a host-based subnet manager, edit the entry to look similar to the following

example:export FF_ALL_ANALYSIS="${FF_ALL_ANALYSIS:-fabric chassis hostsm}"

– If you have an embedded subnet manager, edit the entry to look similar to the followingexample:export FF_ALL_ANALYSIS="${FF_ALL_ANALYSIS:-fabric chassis esm}"

– Using a pattern that matches the names of your switches, set up the FF_FABRIC_HEALTHvariable. The following example shows that the default names were left in place. Thedefault names begin with SilverStorm. It also removes the errors that exceedthreshold:export:FF_FABRIC_HEALTH="${FF_FABRIC_HEALTH:- -s -o errors -o slowlinks –Fnodepat:SilverStorm*}""

v Also, if applicable, ensure that the /etc/sysconfig/iba/esm_chassis file has the list of switchIP addresses for switches that are running the embedded-subnet manager.

c. Ensure that the /etc/sysconfig/iba/ports file has a list of ports on the fabric managementserver. The format is a single line that lists the HCA ports on the fabric management server thatare attached to the subnets. There should be one port for each subnet. The format for identifyinga port is [hca]:[port]. If four ports are connected, the ports file should have a single line: 1:1 1:22:1 2:2

92 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 103: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

d. Ensure that the tcl and Expect code is installed on the fabric management server. The codeshould be at least at the following levels. You can check them by using the rpm -qa | grep expectand rpm -qa | grep tcl commands.v expect-5.43.0-16.2v tcl-8.4.12-16.2

e. If this is the primary data collection point for fabric diagnosis, ensure that this is noted. Onemethod would be to add this information to the /etc/motd file.

7. F6 (M2): If you are using a host-based Fabric Manager, install it by using the Fabric Manager UsersGuide. The following utilities are helpful rpms to install:a. iview_agent-4_2_0_0_xx.rpm (4_2_0_0_xx refers to the level of the agent code)b. iview_fm-4_2_0_0_xx.rpm (4_2_0_0_xx refers to the level of the Fabric Manager code)c. sm_query (sm_query is a utility to obtain information from the subnet manager)

Note: Do not start the Fabric Managers until the switch fabric is installed and cabled completely.Otherwise, you cause unnecessary log activity from the Fabric Manager, which could causeconfusion when you try to verify fabric operation.

d. Run the iview_fm stop command to ensure that the subnet manager is stopped until it isrequired.Verify that the subnet manager is stopped by running the ps –ef|grep iview command.

8. F6 (M2): Configure the host-based Fabric Manager by updating the iview_fm.config file by usingthe Fabric Manager Users Guide.There is a separate instance of the various fabric management components running to manage eachsubnet. In the iview_fm.config file, configure each instance of each component.a. At the beginning of the parameter settings in the iview_fm.config file, you must configure each

component of each instance of the Fabric Manager to start when you start the Fabric Manager.Below, each attribute begins with SM_X_<attribute>, where X is the subnet manager instance onthe fabric management server. To see an example of how these parameters would look for theiview_fm.config file that is used for managing four subnets, see “Example: Setting up ofhost-based fabric manager” on page 43.BM_X_start=yesFE_X_start=yesPM_X_start=yesSM_X_start=yes

Note: Any instances that are not in use should be set to start=no, such as, SM_2_start=no.b. Point to the correct HCA for each Fabric Manager instance:

SM_X_device=<hca>PM_X_device=<hca>BM_X_device=<hca>FE_X_device=<hca>

c. Point to the correct port on the HCA for each Fabric Manager instance:SM_X_port=<hca port>PM_X_port=<hca port>BM_X_port=<hca port>FE_X_port=<hca port>

d. Set the priority for each Fabric Manager instance, such as SM_X_priority=<priority>:SM_X_priority=<priority>PM_X_priority=< priority>BM_X_priority=< priority>FE_X_priority=<priority>

e. For LMC=2, use SM_X_lmc 2.f. For the maximum transfer unit (MTU), use the value that you calculated in “Planning for

maximum transfer units (MTUs)” on page 34. SM_X_def_mc_mtu=0x5 #0x4=2 KB; 0x5=4 KB

Clustering with high-performance computing by using InfiniBand hardware 93

Page 104: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

g. For the MTU rate, use the value that you calculated in “Planning for maximum transfer units(MTUs)” on page 34. SM_X_def_mc_rate=0x6 # 0x3 for SDR; 0x6 for DDR.

h. For the global identifier (GID) prefix, use SM_X_gidprefix=<GID prefix value>

i. For node appearance or disappearance threshold of 10, use SM_x_node_appearance_msg_thresh=10

9. Cable the fabric management server to the InfiniBand fabric.

Note: The switches must have been installed as instructed in “Installing and configuring vendorInfiniBand switches” on page 116.

10. F7 (M2): Use a static IP address for the cluster VLAN for the fabric management servers. Assign andconfigure this address. This is required for remote logging to and remote command processing fromthe Cluster Systems Management/Management Server (CSM/MS).

11. F8 (M3): Cable the fabric management server to the cluster VLAN. It must be on the same VLAN asthe switches.

12. Before proceeding, ensure that the fabric management server is cabled to the InfiniBand fabric andthat the switches are powered on.

13. F9 (M4): Perform final fabric management server configuration and verification:a. If you are using a host-based subnet manager, make sure that the embedded subnet managers are

not running (unless you plan to use both):i. Run the cmdall -C 'smControl status' command.ii. If one or more embedded subnet managers are running, stop them by using the cmdall -C'smControl stop' command.iii. Ensure that the embedded subnet manager does not start on reboot by using the cmdall–C ‘smConfig startAtBoot no' command.

b. If you are using a host-based subnet manager, enable and start the Fabric Manager by usinginstructions from the Fabric Manager Users Guide. The key commands are as follows:

i. /etc/init.d/iview_fm enableii. /etc/init.d/iview_fm start

c. Verify correct security configuration for switches by ensuring that each switch has the requireduser and password enabled.

i. Run the cmdall -C 'loginMode' command.ii. If the return value is not set to 0, enable it.iii. Run the cmdall -C 'loginMode 0' command.

14. Set up passwordless SSH communication between the fabric management server and the switchesand other fabric management servers. If this is not wanted, you need to set up password informationfor the Fast Fabric Toolset; skip to step 15 on page 95.a. Generate the key on the fabric management server. Depending on local security requirements,

you typically do this for the root on the fabric management server. Typically, you use the/usr/bin/ssh-keygen -t rsa command.

b. Set up the secure fabric management server to switch communication by using the followinginstructions:1) Exchange the key by using the cmdall –C ‘sshKey add “Fabric/MS key”' command, where

Fabric/MS key is the key.

Note: The key is in the ~/.ssh/id_rsa.pub file. Use the entire contents of the file as theFabric/MS key. Remember to type quotation marks around the key, and single quotation marksaround the entire sshKey add command.

2) Ensure that the following information is in the /etc/fastfabric.conf export file:FF_LOGIN_METHOD="${FF_LOGIN_METHOD:-ssh}"

c. Set up secure communication between the fabric management servers by using one of thefollowing methods:v Use the setup_ssh command in the Fast Fabric Toolset.

94 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 105: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

v Use the Fast Fabric Toolset iba_config menu. Select Fast Fabric → Host setup → SetupPassword-less ssh/scp.

v Use typical key exchange methods between Linux servers.15. If you chose not to set up passwordless SSH from the fabric management server to switches and to

other fabric management servers, you must update the /etc/sysconfig/fastfabric.conf file withthe correct password for admin. The following procedure uses the password xyz. For detailedinstructions, see the Fast Fabric Users Guide.a. Edit the /etc/sysconfig/fastfabric.conf file and ensure that the following lines are included in

the file and are not commented out. FF_LOGIN_METHOD and FF_PASSWORD are used for fabricmanagement server access. FF_CHASSIS_LOGIN_METHOD and FF_CHASSIS_ADMIN_PASSWORD are usedfor switch chassis access.export FF_LOGIN_METHOD="${FF_LOGIN_METHOD:-telnet}"

export FF_PASSWORD="${FF_PASSWORD:-}"

export FF_CHASSIS_LOGIN_METHOD="${FF_CHASSIS_LOGIN_METHOD:-telnet}"

export FF_CHASSIS_ADMIN_PASSWORD="${FF_CHASSIS_ADMIN_PASSWORD:- xyz}

b. Run the chmod 600 /etc/sysconfig/fastfabric.conf command. This command ensures that onlythe root can use the Fast Fabric tools and recognize the updated password.

16. It is a good practice to enter the configuration information for the server in its /etc/motd file. Usethe information from the “QLogic Fabric Management work sheets” on page 72.

17. If you want to monitor the fabric by running the health check on a regular basis, review “Setting upperiodic fabric health checking” on page 136. Do not set up regular health checks until the fabric hasbeen installed and verified.

Related concepts

“Planning the Fabric Manager and Fabric Viewer” on page 40Use this information to plan for the Fabric Manager and the Fabric Viewer.“Fabric manager” on page 20The Fabric Manager is used to complete basic operations such as fabric discovery, fabric configuration,fabric monitoring, fabric reconfiguration after failure, and reporting problems.

Setting up remote loggingRemote logging to the Cluster Systems Management/Management Server (CSM/MS) helps you monitorclusters by consolidating logs to a central location.

This procedure involves setting up remote logging from the following locations to the CSM/MS:v The fabric management serverv InfiniBand switchesv Hardware Management Console (HMC)

Figure 12 on page 96 shows tasks L1 through L6 for setting up remote logging. It also shows how theremote logging setup tasks relate to the key tasks illustrated in Figure 11 on page 85.

Clustering with high-performance computing by using InfiniBand hardware 95

Page 106: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Setting up the CSM/MS:

1. Ensure you have completed the following tasks:a. H6: The HMCs have been installed and cabled to the service virtual local area network (VLAN) in

Figure 11 on page 85.b. CM4: The CSM/MS has been installed and cabled to the service and cluster VLANs.c. F8: The fabric management server has been installed and cabled to the cluster VLAN.d. W3: The switches have been installed and cabled to the cluster VLAN.e. E2: The service and cluster VLANs Ethernet devices have been installed and cabled.

2. L1 (M4): Select from the following options to set up remote logging and event management for thefabric on the CSM/MS.v If the CSM/MS is running the AIX operating system, go to “Using the remote syslog command and

event management for CSM on AIX” on page 97.v If the CSM/MS is running the Linux operating system, go to “Using the remote syslog command

and event management for CSM on Linux” on page 98.

Notes:

Figure 12. Set up remote logging

96 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 107: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

v Even if you do not plan to use CSM, the remote syslog command setup instructions are useful toconsolidate subnet manager and switch logs into one location.

v It is assumed that the fabric management server setup for the remote syslog command has alreadybeen done.

v It is assumed that the CSM/MS is not defined as a managed node. It is assumed thatadministrators who set up the CSM/MS as a managed node are experienced and can modify thisprocedure to accommodate their configuration. The key is to monitor the /var/log/csm/syslog.fabric.notices file by using a sensor and setting up a condition to monitor that sensor anddirect the log entries to the /var/log/csm/errorlog/[CSM/MS hostname] file.

Using the remote syslog command and event management for CSM on AIX:

You point the syslogd command to one or two files into which to place the remote syslog files.

Note: It is assumed that you are using the syslogd command on the CSM/MS. If you are using anothersyslog application, such as syslog-ng, you might have to set up things differently, but these instructionscan be useful in understanding how to set up the syslog configuration.1. Log on to the CSM/MS and run the AIX operating system as the root.2. Edit the /etc/syslog.conf file to direct the syslog commands to a file to be monitored by CSM event

management. The basic format of the line is [facility].[min. priority] [destination]. If you areusing syslog-ng, you need to adjust the format to accomplish the same type of function.Add the following lines, so that local6 facilities (used by subnet manager and the switch) with logentry priorities (severities) of Info or higher (for example, Notice, Warning, or Error) are directed to alog file for debug purposes. The disadvantage of this is that /var must be monitored more closely sothat it does not fill up. If you cannot maintain the /var log, you can omit this line.# optional local6 info and above priorities in another filelocal6.info /var/log/csm/syslog.fabric.info

Note: You can use different file names, but you must record them and update the rest of theprocedure steps with the new names.

3. Run a touch command on the output files, because the syslog command does not create them on itsown. To run the touch command, perform the following steps:a. Run the touch /var/log/csm/syslog.fabric.notices command.b. Run the touch /var/log/csm/syslog.fabric.info command.

4. Refresh the syslog daemon by using the refresh -s syslogd command.5. Set up a sensor for the syslog.fabric.notices file by copying the default and changing the default

priority filter and monitored file.To set up a sensor file, perform the following steps:a. Run the lsrsrc -i -s "Name= ’AIXSyslogSensor’" IBM.Sensor > /tmp/AIXSyslogSensorDef

command.b. Modify the /tmp/AIXSyslogSensorDef file by updating the Command attribute to

/opt/csm/csmbin/monaixsyslog -p local6.notice -f /var/log/csm/syslog.fabric.notices

c. Remove the old sensor by using the rmsensor AIXSyslogSensor command.d. Create the new sensor and keep its scope local by using the CT_MANAGEMENT_SCOPE=0

mkrsrc –f /tmp/AIXSyslogSensorDef IBM.Sensor command.e. To set up local management scope to avoid receiving an error indicating that the node (CSM/MS)

is not in the NodeNameList, perform the following steps:1) Run the following command: /opt/csm/csmbin/monaixsyslog -f /var/log/csm/

syslog.fabric.notices –p local6.notice

2) Wait approximately 2 minutes and check the /etc/syslog.conf file. The sensor might haveplaced the following line in the file. The default cycle for the sensor is to check the files every

Clustering with high-performance computing by using InfiniBand hardware 97

Page 108: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

60 seconds. The first time it runs, it recognizes that it needs to set up the syslog.conf file withthe following entry: local6.notice /var/log/csm/syslog.fabric.notices rotate size4m files 1

6. Set up the condition for the sensor and link a response to it by performing the following steps:

Note: The method documented here is for a CSM/MS that has not been defined as a managed node.If the CSM/MS is defined as a managed node, you do not set the scope of the condition to be local.a. Create a copy of the prepackaged condition AIXNodeSyslog file and set the ManagementScope

parameter to local (l for local) by using the following command: mkcondition -c AIXNodeSyslog-m l LocalAIXNodeSyslog

b. Link a response by using the startcondresp LocalAIXNodeSyslog LogNodeErrorLogEntrycommand.

Note: The /var/log/csm/errlog/[CSM hostname] file is not created until the first event comesthrough.

c. If you want to broadcast (wall) the events to the system console, enter the startcondrespLocalAIXNodeSyslog BroadcastEventsAnyTime command.

Note: Using the BroadcastEventsAnyTime command results in many events being broadcast to theconsole when servers are rebooted.

d. If you want to create any other response scripts, use a similar format for the startcondrespcommand after creating the appropriate response script. see the CSM Reference Guide and RSCTReference Guide on how to do this.

Note: If there are problems with the event management from this point forward, and you have toremake the AIXSyslogSensor file, you need to follow the procedure in “Reconfiguring the CSM eventmanagement” on page 196.

7. Continue with “Setting up a remote log for the fabric management server” on page 100.

Using the remote syslog command and event management for CSM on Linux:

You point the syslogd command to a first in, first out (FIFO) file for serviceable events and to a file forinformational events.

Note: This procedure documents the use of the syslog-ng command, which is the default syslogdcommand for SLES 10 SP1 and higher. If the level of the Linux operating system on the CSM/MS isusing the syslog command instead of the syslog-ng command, see “Using syslog for CSM/MS on RedHatLinux” on page 103. When you return to this procedure, continue with step 8 on page 99.1. Log on to the CSM/MS and run the Linux operating system as the root.2. Edit the configuration file for the syslogd command so that it directs entries coming from the fabric

management server and the switches to an appropriate file.

Note: Log entries with a priority (severity) of Info or lower are logged to the default location/var/log/messages.To edit the /etc/syslog-ng/syslog-ng.conf file, perform the following steps:a. Edit the /etc/syslog-ng/syslog-ng.conf file.b. Add the following lines to the end of the file:

# Fabric Notices from local6 into a FIFO/named pipe filter f_fabnotices { facility(local6) and level(notice,alert, warn, err, crit) and not filter(f_iptables);};

The sensor is created showing the lines in the /etc/syslog-ng/syslog-ng.conf file that arerequired to direct the entries to a particular log file.

98 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 109: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

3. Ensure that udp(ip("0.0.0.0") port(514)); is in the source reference (SRC) stanza and that it is notcommented out. You must use User Datagram Protocol (UDP) to receive logs from switches and thefabric management server.The ip("0.0.0.0") entry indicates that the server allows entries from any IP address. For addedsecurity, you might want to specify each switch and fabric management server IP address on aseparate line. You must use the appropriate protocol as defined previously.

udp(ip("192.9.3.42") port(514));udp(ip("192.9.3.50”) port(514));

4. Configure the AppArmor application using the syslog-ng command to allow the syslog-ng commandto access the named pipe file (var/log/csm/syslog.fabric.notices) to which the remote syslogentries are directed. The syslog-ng command requires read-write permission to named pipes. Toconfigure the AppArmor application, perform the following steps:a. Edit the syslog-ng file for the AppArmor application: /etc/apparmor.d/sbin.syslog-ngb. Add “/ var/log/csm/syslog.fabric.notices wr,”, just before the closing brace, “}”, in the

/sbin/syslog-ng stanza. For example:/sbin/syslog-ng {

#include <abstractions/base>.../var/run/syslog-ng.pid w,/var/log/csm/syslog.fabric.notices lang=NO-BOK wr,}

5. Restart the AppArmor application by using the /etc/init.d/boot.apparmor restart command.6. Set up a sensor for the syslog.fabric.notices file by copying the default and changing the default

priority filter and monitored file by using the following steps:a. Run the lsrsrc -i -s "Name= 'ErrorLogSensor'" IBM.Sensor > /tmp/ErrorLogSensorDef command.b. Modify the /tmp/ErrorLogSensorDef file by updating the Command attribute to

Command=”/opt/csm/csmbin/monerrorlog -f /var/log/csm/syslog.fabric.notices -pf_fabnotices”.

c. Remove the old sensor by using the rmsensor ErrorLogSensor command.d. Create the new sensor and keep its scope local by using the CT_MANAGEMENT_SCOPE=0

mkrsrc –f /tmp/ErrorLogSensorDef IBM.Sensor command.e. Run the /opt/csm/csmbin/monerrorlog -f "/var/log/csm/syslog.fabric.notices" -p "f_fabnotices"

command.Notice that the –p parameter points to the f_fabnotices entry that was defined in/etc/syslog-ng/syslog-ng.conf file.

7. If you receive an error from the monerrorlog file that indicates there is a problem with the syslogcommand, there is probably a typographical error in the /etc/syslog-ng/syslog-ng.conf file. Themessage includes the syslog command in the error message, similar to the following example, where* is a wildcard.monerrorlog: * syslog *

To recover from this error, perform the following steps:a. Look for the typographical error in the /etc/syslog-ng/syslog-ng.conf file by reviewing the

previous steps that you have taken to edit the syslog-ng.conf file.b. Remove the destination and log lines from the end of syslog-ng.conf file.c. Rerun the /opt/csm/csmbin/monerrorlog -f "/var/log/csm/syslog.fabric.notices" -p

"f_fabnotices" command.d. If you receive another error, examine the file again and repeat step 7.

8. Check the /etc/syslog-ng/syslog-ng.conf file to ensure that the sensor set it up correctly. Thefollowing lines might be at the end of the file:

Clustering with high-performance computing by using InfiniBand hardware 99

Page 110: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Note: Because a generic CSM command is being used for the InfiniBand cluster, the monerrorloguses a different name from the fabnotices_fifo in the destination and log entries. It is apseudo-random name that looks similar to fifonfJGQsBw.filter f_fabnotices { facility(local6) and level(notice, alert, warn, err,

crit) and not filter(f_iptables); };destination fabnotices_fifo { pipe("/var/log/csm/syslog.fabric.notices"

group(root) perm(0644)); };log { source(src); filter(f_fabnotices); destination(fabnotices_fifo); };

9. Set up the condition for the previous sensor and link a response to it by performing the followingsteps.

Notes:

v The method you use depends on whether the CSM/MS is defined as a managed node.v The method documented here is for a CSM/MS that has not been defined as a managed node. If

the CSM/MS is defined as a managed node, you did not set the scope of the condition to be local.a. Make a copy of the prepackaged condition AnyNodeAnyLoggedError, and set the ManagementScope

to local (l for local).mkcondition -c AnyNodeAnyLoggedError -m l LocalNodeAnyLoggedError.b. Link a response to the condition that can log entries by using the startcondresp

LocalNodeAnyLoggedError LogNodeErrorLogEntry command.c. The previous condition-response link logs node error log entries to the /var/log/csm/errorlog/

[CSM/MS hostname] file on the CSM management server.d. If you want to broadcast (wall) the events to the system console, enter the startcondresp

LocalNodeAnyLoggedError BroadcastEventsAnyTime command.

Note: Using the BroadcastEventsAnyTime command results in many events being broadcast tothe console when servers are rebooted.

e. If you want to create any other response scripts, use a similar format for the startcondrespcommand after creating the appropriate response script. For details, see the CSM Reference Guideand RSCT Reference Guide.

10. Continue with “Setting up a remote log for the fabric management server.”

Setting up a remote log for the fabric management server:

1. L2 (M4): Point to the CSM/MS as a remote syslog server from the fabric management server bycompleting the following steps:

Note: It is assumed that you are using the syslogd command on the CSM/MS. If you are using thesyslog-ng command, you might need to set up the remote log differently. However, the followinginstructions can be useful to gain an understanding on how to set up the syslog configuration.a. Do not proceed until you have installed, configured, and cabled the fabric management server to

the service VLAN. For details, see the “Installing the fabric management server” on page 91 topic.You also must have installed, configured, and cabled the CSM/MS as documented in the“Installing the CSM Management Server” on page 89 topic.

b. Log on to the fabric management server.c. Edit the /etc/syslog.conf ( file by performing one of the following steps:

Note: Some Linux levels use the /etc/syslog-ng/syslog-ng.conf file.v If you are using the syslog command, add the following lines to the end of the file. Remove the

brackets when you enter the CSM/MS IP address.# send IB SM logs to CSM/MS (“CSM IP-address”)local6.* @CSM/MS IP-address

v If you are using the syslog-ng command, add the following text to the end of the file. Use udpas the transfer protocol You must configure the syslog-ng command on the CSM/MS to acceptone or the other, or both.

100 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 111: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

# Fabric Info from local6 to CSM/MS (“CSM IP-address”)filter f_fabinfo { facility(local6) and level(info, notice, alert, warn, err,crit) and not filter(f_iptables); };destination fabinfo_csm { udp("CSM/MS IP-address" port(514)); };log { source(src); filter(f_fabinfo); destination(fabinfo_csm); };

Note: If you want to log on to more than one CSM/MS, or to another server, ensure to change thedestination handle of the statement for each instance, and then refer to a different one for each logstatement. For example, fabinfo_csm1 and fabinfo_csm2 would be good handles for logging todifferent CSM/MSs.

d. Restart the syslog daemon by using the /etc/init.d/syslog restart command. If the syslogdaemon is not running, use the /etc/init.d/syslog start command.You have now set up the fabric management server to remotely log to the CSM/MS.

2. Continue with “Setting up remote logging for InfiniBand switches.”

Setting up remote logging for InfiniBand switches:

1. L3 (M4): Point the switch logs to the CSM/MS by completing the following steps:a. Do not proceed until you have installed, configured, and cabled the fabric management server to

the service VLAN as described in “Installing and configuring vendor InfiniBand switches” on page116. You must also have installed, configured, and cabled the CSM/MS as described in “Installingthe CSM Management Server” on page 89.

b. Use the switch documentation to point the switch to a remote syslog server using the commandline or the Chassis Viewer.v To use the command line, or from the cmdall command on the Fast Fabric Toolset, run the

logSyslogConfig -h csm_ip_address –f 22 –p 514 –m 1 command.v To use the Chassis Viewer, on the Syslog Host tab use the IP address of the CSM/MS and point

to Port 514. You must do this for each switch individually.c. Ensure that all priority logging levels with a severity higher than Info are set to LOG by using the

logShowConfig command on the switch command line or by using the Chassis Viewer to look atthe log configuration. If you need to turn on Info entries, use the following methods.v On the switch command line, use the logConfigure command and follow the instructions on the

display.v In Chassis Viewer, use the Log Configuration window.

Note: The switch command line and Chassis Viewer do not necessarily list the log priorities withrespect to severity. Ensure that a logShowConfig command results in a result similar to thefollowing example, where Dump, Fatal, Error, Alarm, Warning, Partial, Config, Periodic, andNotice are enabled. The following example has Info enabled as well, but that is optional.Configurable presetsindex : name : state------------------------------

1 : Dump : Enabled2 : Fatal : Enabled3 : Error : Enabled4 : Alarm : Enabled5 : Warning : Enabled6 : Partial : Enabled7 : Config : Enabled8 : Info : Enabled9 : Periodic : Enabled15 : Notice : Enabled10 : Debug1 : Disabled11 : Debug2 : Disabled12 : Debug3 : Disabled13 : Debug4 : Disabled14 : Debug5 : Disabled

Clustering with high-performance computing by using InfiniBand hardware 101

Page 112: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

You have now set up the switches to remotely log to the CSM/MS.2. Continue with “Setting up remote logging on the HMC.”

Setting up remote logging on the HMC:

To set up remote logging on the Hardware Management Console (HMC), perform the following steps:1. L4 (M4): Set up serviceable event monitoring on the CSM/MS and the HMCs. See the CSM

Installation Guide for instructions on serviceable event monitoring.

Note: Serviceable event monitoring is useful when a cluster has more than one HMC.2. Continue with “Verifying the remote logging setup”

Verifying the remote logging setup:

Before you can verify the remote logging setup, you must have completed “Setting up the CSM/MS” onpage 96 and “Setting up a remote log for the fabric management server” on page 100. To verify theremote logging setup, perform the following steps:1. L5 (M4): Verify the remote syslog command and event management path from the fabric management

server through to the CSM/MS /var/log/csm/errorlog/[CSM/MS hostname] file by completing thefollowing steps:a. Log on to the fabric management server.b. Create a Notice level log and an Info level log. Replace XXX with your initials.

logger -p local6.notice XXX: This is a Notice test from the fabric management serverlogger -p local6.info XXX: This is an Info test from the fabric management server

c. Log on to the CSM/MS to see if the log is recorded. It might take 1 - 2 minutes before the eventmanagement sensor senses the /var/log/csm/syslog.fabric.notices log entry file on theCSM/MS.

d. Check the /var/log/csm/errorlog/[CSM/MS hostname] file and verify that only the Notice entrywas logged in it. The Info entry might not have been recorded into the syslog.fabric.notices fileand, therefore, might not have been recognized by the sensor.If you have waited for 5 minutes and the Notice entry was not logged in the /var/log/csm/errorlog/[CSM/MS hostname] file, check the following items:v Review the previous setup instructions in this topic to ensure that they were performed

correctly, paying close attention to the setup of the /etc/syslog.conf file (or syslog-ng.conffile).

v Use the procedure in “Determining problems with event management or remote syslogging” onpage 189. Recall that you were using the logger command such that the fabric managementserver would be the source of the log entry.

e. Check the /var/log/csm/syslog.fabric.info file and verify that both the Notice entry and theInfo entry are in the file. This step only applies if you have chosen to set up thesyslog.fabric.info file.If one or both entries are missing, check the following items:v Review the previous setup instructions in this topic to ensure that they were performed

correctly, paying close attention to the setup of the /etc/syslog.conf(or syslog-ng.conf) file.v Use the procedure in “Determining problems with event management or remote syslogging” on

page 189. Recall that you were using the logger command such that the fabric managementserver would be the source of the log entry.

2. L6 (M4): Verify the remote syslog command from the switches to the CSM/MS by completing thefollowing steps:a. Ping the switches from the CSM/MS to ensure that there is connectivity across the service VLAN.

If the ping command fails, use standard techniques to debug Ethernet interface problems betweenthe CSM/MS and the switches.

102 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 113: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

b. Use the ibtest – C reboot command to reboot all the switch management spines. The ibtest – Creboot command creates a log. For details, see the Fast Fabric Toolset Users Guide.

c. Log on to the CSM/MS to see if the log was recognized. It might take 1 - 2 minutes before theevent management sensor senses the log entry in the CSM/MS/var/log/csm/syslog.fabric.noticesfile. The reboot of the switch chassis management causes log entries from every switch that has aNotice priority and text such as Switch chassis management software rebooted.

d. Check the /var/log/csm/errorlog/[CSM/MS hostname] file and verify that only the Notice entrywas logged in it.If you have waited for 5 minutes and the Notice entry was not logged in the /var/log/csm/errorlog/[CSM/MS hostname] file, check the following items:v Review the previous setup instructions in this topic to ensure that they were performed

correctly, paying close attention to the setup of the /etc/syslog.conf file.v Use the procedure in “Determining problems with event management or remote syslogging” on

page 189. Recall that you were using the logger command such that the fabric managementserver would be the source of the log entry.

e. Check the /var/log/csm/syslog.fabric.info file and verify that the Notice entry is in the file. TheNotice entry appears only if you set up the syslog.fabric.info file.If one or both entries are missing, review the previous setup instructions in this topic for settingup the remote syslog switches to ensure that they were performed correctly.

f. Use the procedure in “Determining problems with event management or remote syslogging” onpage 189. Recall that you were using the ibtest command such that the switches were the source ofthe log entry.

Using syslog for CSM/MS on RedHat Linux:

Use this procedure to set up the syslog command to send log entries from the fabric management serverand switches.

Note: Do not use this procedure unless you were directed here from another procedure.

You were directed to this procedure because the level of Linux you are using on the Cluster SystemsManagement/Management Server (CSM/MS) does not support the syslog-ng command. After completingthe following procedure, you will be returned to the procedure that sent you here.

Use this procedure to set up the syslog command to direct log entries from the fabric management serverand switches.1. Set up a sensor for the syslog.fabric.notices file by using the monerrorlog command, but change

the default priority filter to f_fabnotices and the monitored file to syslog.fabric.notices as shownin the following example:/opt/csm/csmbin/monerrorlog -f "/var/log/csm/syslog.fabric.notices" -p"local6.notice"

2. Wait approximately 2 minutes after running the monerrorlog command. The following informationcan be found in the /etc/syslog.conf file.local6.notice /var/log/csm/syslog.fabric.notices rotate 4m files 1

3. Return to step 8 on page 99 in Remote Syslogging and Event Management for CSM on Linux.

Setting up remote command processingUse this procedure to set up remote command processing (dsh) commands from Cluster SystemsManagement (CSM) to the switches and to the fabric management server.

Remote command processing to the fabric management server setup is a standard Linux node setup,except that the fabric management server is treated as a device.

Clustering with high-performance computing by using InfiniBand hardware 103

Page 114: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Remote command processing to the switches is standard hardware device set up.

Figure 13 illustrate tasks R1, R2, and R3 for setting up remote command processing. It also illustrateshow the setup tasks for the remote command processing relate to the key tasks that are illustrated inFigure 11 on page 85.

Do not proceed with this procedure until all the following tasks have been completed:1. The Cluster Systems Management/Management Server (CSM/MS) has been installed and cabled to

the service and cluster virtual local area networks (VLANs) (CM4).2. The fabric management server has been installed and cabled to the cluster VLAN (F8).3. The switches have been installed and cabled to the cluster VLAN (W3).4. The Ethernet devices on the service and cluster VLAN have been installed and cabled (E2).

To set up remote command processing, complete the following steps:1. R1 (M4): Set up remote command processing with the fabric management server as follows:

Figure 13. Remote command processing setup

104 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 115: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Note: The following method is just one of several methods by which you can set up remotecommand processing to a fabric management server. You can use any method that meets yourrequirements. For example, you can set up the fabric management server as a node. By setting it upas a device rather than a node, you might find it easier to group it differently from the IBM servers.a. If you are only defining a single fabric management server as a device for CSM, use the following

command. Otherwise, go to step 1b.definehwdev -d [Fabric/MS] DeviceType=FabricMS

RemoteShellUser=[USERID] RemoteShell=/usr/bin/sshRemoteCopyCmd=/usr/bin/scp

b. To define multiple fabric management servers, use the -f flag to identify a device definition file.For details, see the CSM Command Reference Guide.The following are the key attributes of the device definition file:v Name: The host name of the network adapter for the Linux host.v DeviceType: A unique name for the type of device, for example, QLogicMS for QLogic

Management Server.v The following attributes are for Secure Shell (SSH) and dsh:

– RemoteShellUser equals [USERID]. The USERID has permissions to Fast Fabric or the FabricManager.

– RemoteShell equals /usr/bin/ssh.– RemoteCopyCmd equals /usr/bin/scp.

Note: See the CSM man pages about deviceattributes for a list of available attributes for defining adevice.

c. Define at least one hardware group to address all fabric management servers at once:hwdevgrp -w "DeviceType==’FabricMS’” AllFabricMS

d. Exchange SSH keys by using the updatehwdev -k -D AllFabricMS command.

Note: Because a mixture of devices is being defined, the –a option cannot be used.e. Use dsh -d or dsh -D to remotely access the fabric management server from the CSM/MS.

2. R2 (M4): Set up remote command processing with the switches as follows.

Note: The following method is just one of several methods by which you can set up remotecommand processing to a QLogic switch. You can use any method that meets your requirements. TheQLogic switch does not use a standard shell for its command-line interface (CLI). Therefore, it shouldbe set up as a device and not a node. For the dsh and updatehwdev commands to work, you needthe command definition file.a. Create a definition file for the device type command for the switch device. This is important for

the dsh and updatehwdev commands to work on the proprietary command line switch.1) If the /var/opt/csm/IBSwitch/Qlogic/config file exists, you can skip the creation of this file,

and go to step 2b.2) Create the path by using the /var/opt/csm/IBSwitch/Qlogic command.3) Edit the /var/opt/csm/IBSwitch/Qlogic/config file.4) Add the following lines to the file:

# QLogic switch device configuration# Follow the section format to add entry/value pair similar to below# [main]# EntryName=Value[main]# SSH key add command on device (must be uppercase K)ssh-setup-command=sshKey add[dsh]# Special command before remote command: for example, export environment variablepre-command=NULL

Clustering with high-performance computing by using InfiniBand hardware 105

Page 116: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

# The command used to show the return code of last command processed.# Note: The command output must be a numeric value on the last line.# For example: # hello world!# # 0post-command=showLastRetcode -brief

b. For each switch, define the switch as a device for CSM by using the following command:definehwdev -d [switch address] DeviceType=IBSwitch::Qlogic

RemoteShellUser=admin RemoteShell=/usr/bin/sshRemoteCopyCmd=/usr/bin/scp

c. To define multiple switches, use the -f flag to identify a device definition file. For details, see theCSM Command Reference Guide.The following are the key attributes for the definition file:v Name: The host name of the switch.v DeviceType: A unique name for the type of device.v The following attributes are for SSH and dsh:

– RemoteShellUser equals admin– RemoteShell equals /usr/bin/ssh– RemoteCopyCmd equals /usr/bin/scp

Note: See the CSM man pages about deviceattributes for a list of available attributes for defining adevice.

d. Define a device group for the switches by using the following command:hwdevgrp -w "DeviceType like ’IBSwitch%Qlogic’” AllIBSwitches

Note: Because the DeviceType is IBSwitch::Qlogic, it conflicts with the mkrsrc use of the :: as adelimiter. Therefore, the % is used as a wildcard to avoid this problem.

e. Exchange ssh keys with the IBSwitches group by using the following command:updatehwdev -k –D AllIBSwitches --devicetype IBSwitch::Qlogic

Note: Because a mixture of devices is being defined, the –a option cannot be used.f. Verify remote access to the switches by using the following command:

Note: You do not have to enter a password, and each switch replies with its firmware level./opt/csm/bin/dsh –D AllIBSwitches --devicetype IBSwitch::Qlogic fwVersion |more

g. You can now use the dsh -d or dsh -D command to remotely access the switches from theCSM/MS. Do not forget to use the –-devicetype option so that the dsh command uses theappropriate command sequence to the switches.

3. Optional: R3 (M4): Create device groups to send commands to groups of switches and fabricmanagement servers. In the previous steps, you set up a group for all fabric management servers, andyou set up a group for all switches. For details on setting up device groups, see the CSMAdministration Guide. Some possible groupings include the following examples:v All the fabric management servers (AllFabricMS)v All primary fabric management serversv All the switches (AllIBSwitches)v A separate subnet group for all the switches on a subnet

Installing and configuring servers with management consolesUse this procedure to do the final configuration of the management consoles that work with your server,including the Hardware Management Console (HMC), Cluster Systems Management, and the operatingsystem installation servers. This procedure is the final configuration of the management subsystem.

106 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 117: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

The references in this task are from the Figure 10 on page 54.

Do not start this procedure until you have completed all the following tasks:1. (H6): The HMCs have been installed and cabled to the service virtual local area network (VLAN).2. (CM4): The Cluster Systems Management/Management Server (CSM/MS) has been installed and

cabled to the service and cluster VLANs .3. (E2): The service and cluster VLANs Ethernet devices have been installed and cabled.

To install and configure the server with management consoles, complete the following steps:1. (M4): Perform the procedure in “Installing and configuring the cluster server hardware.” The final

configuration of management consoles is performed during the steps associated with S3 and M4. Thefollowing steps are intended to provide an overview of what is done in the Installing and configuringthe cluster server hardware procedure.

2. If you add servers and host channel adapters (HCAs), you need to ensure the following:a. The bulk power controllers (BPCs) and servers must be at power standby before proceeding. See

the procedure “Installing and configuring server hardware” on page 108 up to and includingmajor task S2.

b. Dynamic host configuration protocol (DHCP) on the service VLAN must be operational.3. When you perform the server installation and configuration procedure, the following tasks need to be

performed:a. Verify that the BPCs and service processors are acquired by the DHCP server on the service

VLAN.b. If using Cluster-Ready Hardware Server (CRHS), set up the peer domains and HMC links in

CRHS on the CSM/MS. For details, see the CSM Administration Guide.c. If using CRHS, perform server and frame authentication with CRHS on the CSM/MS. For details,

see the CSM Administration Guide.

Installing and configuring the cluster server hardwareThis procedure is intended to be completed by an IBM service representative or by the customerresponsible for installing cluster server hardware.

Installing and configuring the cluster server hardware encompasses major tasks S3 through S5 and theserver part of M3 and M4, which are illustrated in Figure 10 on page 54. During this procedure, youinstall and configure the servers for your cluster.

Note: If possible, do not begin this procedure until the “Installing operating system installation servers”on page 90 is completed. This helps avoid the situation where installation personnel are waiting on-sitefor key parts of this procedure to be completed. Depending on the arrival of units on-site, this is notalways practical. Review “Order of installation” on page 52 and Figure 10 on page 54 to identify themerge points where a step in a major task that is being performed by one person is dependent on thecompletion of steps in another major task that is being performed by another person.

Before proceeding, obtain the following documentation:v Server installation documentation.v For Host channel adapter (HCA) installation information, see GX adapter.

If this installation is for expanding or adding hardware to a cluster, before proceeding, review “Installingand configuring servers when expanding or adding to an existing cluster” on page 111.

Clustering with high-performance computing by using InfiniBand hardware 107

Page 118: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Installing and configuring server hardwareUse this procedure to install and configure server hardware for use with your cluster.

Note: Installation and configuration encompass the tasks shown in Figure 10 on page 54.1. Before you start your server hardware installation and configuration, select one of the following

options:v If this is a new installation, go to step 2.v If you are adding servers to an existing cluster, go to step 2.v If you are adding cables to existing host channel adapters (HCAs), proceed to step 12 on page 110.v If you are adding host channel adapters (HCAs) to existing servers, go to “Installing or replacing

an InfiniBand GX host channel adapter” on page 124 and follow the installation instructions forthe HCAs. Then proceed to step 12 on page 110.

2. S3: Position the frames or racks according to the data center floor plan.3. Choose from the following items, and then go to the appropriate step for your cluster.v If you have a single Hardware Management Console (HMC) in the cluster and you are not using

Cluster Systems Management (CSM) and a Cluster-Ready Hardware Server (CRHS) in yourcluster, go to step 4.

v If you are using CSM and CRHS in your cluster, go to step 5.4. If you have a single HMC and you are not using CSM and a CRHS in your cluster, complete the

following steps:a. S1: Position the servers in frames or racks and install the HCAs. Do not connect or apply power

to the servers at this time.

Note: Do not proceed with the server installation instructions past the point where youphysically install the hardware.Follow the installation procedures for servers found in the following resources:v If your server model is installed by an IBM service representative, contact your next level of

support.v If your server is a customer installable model, refer to the IBM(r) Power Systems(tm)

Hardware Information Center. Select Power Systems information, select the model you areworking on, and then select Installing and configuring the system.

b. Verify that the HMC is configured and operational.After the Ethernet service virtual local area network (VLAN) and management consoles havebeen installed and configured, they are ready to discover and connect to frames and servers onthe Ethernet service VLAN.

c. Proceed to step 6 on page 109.5. If you are using CSM and a CRHS in your cluster, complete the following steps:

a. (S1): Position servers in frames or racks and install the HCAs. Do not connect or apply power tothe servers at this time.

Note: Do not proceed in the server installation instructions (customer or service representative)past the point where you physically install the hardware.Follow the installation procedures for servers found in the following resources:v If your server model is installed by an IBM service representative, contact your next level of

support.v If your server is a customer installable model, refer to the IBM(r) Power Systems(tm)

Hardware Information Center. Select Power Systems information, select the model you areworking on, and then select Installing and configuring the system.

b. (S2): Verify that the Dynamic Host Configuration Protocol (DHCP) server is running on the CSMmanagement server.

108 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 119: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

After the Ethernet service VLAN and management consoles have been initially installed andconfigured, they are ready to discover and connect to frames and servers on the Ethernet serviceVLAN.

c. Proceed to step 6.6. To connect the resources in each rack of servers to the Ethernet service VLAN and verify that

addresses have been correctly served for each frame or rack of servers, perform the followingprocedure. By doing this one frame or rack at a time, you can verify that addresses have been servedcorrectly, which is critical for cluster operation.a. (M3): Connect the frame or server to the Ethernet service VLAN.

Follow the installation procedures for servers found in the following resources:v If your server model is installed by an IBM service representative, contact your next level of

support.v If your server is a customer installable model, refer to the IBM(r) Power Systems(tm)

Hardware Information Center. Select Power Systems information, select the model you areworking on, and then select Installing and configuring the system.

Note: Do not proceed with the server installation instructions (customer or servicerepresentative) past the point where you attach the Ethernet cables from the frames and serversto the Ethernet service VLAN.

b. Attach power cables to the frames and servers.Follow the installation procedures for servers found in the following resources:v If your server model is installed by an IBM service representative, contact your next level of

support.v If your server model is a customer installable model, refer to the IBM(r) Power Systems(tm)

Hardware Information Center. Select Power Systems information, select the model you areworking on, and then select Installing and configuring the system.

c. (S2): Apply power to the system racks or frames through the unit emergency power off (UEPO)switch. Allow the servers to reach the power standby state (Power Off). For servers in frames orracks without bulk power assemblies (BPAs), the server boots to the power standby state afterconnecting the power cable.

Note: Do not press the power button on the control panels or apply power to the servers so thatthey boot to the logical partition standby state.

d. (S3): Use the following procedure to verify that the servers are now visible on the DHCP server:1) Check the DHCP server to verify that each server and bulk power controller (BPC) has been

given an IP address. For a frame with a BPC, the IP addresses that are assigned for each BPCand service processor connection are shown. For a frame or rack with no BPC, the IP addressthat is assigned for each service processor connection is shown.

2) Record the association between each server and its assigned IP address.7. (M4): Do one of the following steps:v If you are not using CRHS, skip to step 8.v If you are using CRHS, after each server and BPC is visible on the DHCP server, by using

instructions for CRHS in the CSM installation documentation, connect the frames and servers byassigning them to their respective managing HMC. Go to step 9.

8. If you are not using CRHS, in the Server and Frame Management windows, verify that each HMChas visibility to the appropriate servers and frames that it controls.

9. (M4): Authenticate the frames and servers.10. (S3): In the server and frame management windows on each HMC, verify that you can see all the

servers and frames to be managed by the HMC.

Clustering with high-performance computing by using InfiniBand hardware 109

Page 120: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

11. (S4): If you have IBM systems with 609.6 mm (24 in.) racks, ensure that the servers and powersubsystems in your cluster are all at the correct firmware levels. See the IBM clusters with theInfiniBand switch Web site for information regarding the most current release levels for the systemfirmware and the power subsystem firmware.Follow the links to the appropriate download sites and instructions in the IBM clusters with theInfiniBand switch Web site.

12. S5: Verify system operation from the HMCs by performing the following procedure at each HMC forthe cluster:a. Bring the servers to the logical partition standby state and verify the viability of the system by

waiting several minutes and checking the Manage serviceable events task in the HMC. If youcannot bring a server to the logical partition standby state, or there is a serviceable eventreported in the Manage serviceable events task, perform the prescribed service procedure.

b. To verify each server, use the following procedure to run the online diagnostics:1) Depending on the server and who is doing the installation, run these diagnostics from the

CD-ROM, Network Installation Management (NIM) shared product object tree (SPOT), orconcurrently from an installed AIX operating system. The logical partition must be configuredand activated before you run online diagnostics.

2) To resolve a problem with a server, check the diagnostic results and the Manage serviceableevents task in the HMC and follow the maintenance procedures.

Note: Typically, the responsibility of the IBM service representative ends here for IBM serviceinstalled frames and servers. However, from this point forward, after the IBM servicerepresentative leaves the site, if any problem is found in a server or with an InfiniBand link, aservice call must be placed.

The IBM service representative should recognize that the HCA link interface and InfiniBandcables have not been verified yet. The HCA link interface and InfiniBand cables will not beverified until the end of the procedure for InfiniBand network verification, which might beperformed by either the customer or an independent vendor. When the IBM servicerepresentative leaves the site, it is possible that the procedure for InfiniBand network verificationmight identify a faulty link, in which case the IBM service representative might receive a servicecall to isolate and repair a faulty HCA or cable.

Introduction to installing the operating system and configuring thecluster serversThis procedure is for the customer who is installing the operating system and configuring the clusterservers.

Installing the operating system and configuring the cluster servers encompasses major tasks S6 and S7and the server part of M4, which are illustrated in Figure 10 on page 54. During this procedure, youinstall the operating system and configure the servers in the cluster.

Note: If possible, do not begin this procedure until you complete the procedure found in “Installing andconfiguring the management subsystem” on page 83. This helps avoid the situation where installationpersonnel might be waiting on-site for key parts of this procedure to be completed. Depending on thearrival of units on-site, this is not always practical. Review “Order of installation” on page 52 andFigure 10 on page 54 to identify the merge points where a step in a major task that is being performed byone person is dependent on the completion of steps in another major task that is being performed byanother person.

Before proceeding, obtain the following documentation:v Operating system installation information.

110 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 121: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

v Host channel adapter (HCA) installation information in the IBM Power Systems Hardware InformationCenter.

Installing and configuring servers when expanding or adding to an existing clusterUse this information for details about expanding or adding to an existing cluster.

If you are adding or expanding InfiniBand network capabilities to an existing cluster by adding servers tothe cluster, then you need to approach the server installation and configuration differently than with anew cluster flow. The flow for server installation and configuration is based on a new cluster installation,but it indicates where there are variances for expansion scenarios.

Table 58 outlines how the new cluster installation is affected or altered by expansion scenarios.

Table 58. Effects on cluster installation when expanding existing clusters

Scenario Effects

Adding InfiniBand hardware to an existing cluster(switches and host channel adapters (HCAs))

v Configure the logical partitions to use the HCAs.

v Configure HCAs for switch partitioning.

Adding new servers to an existing InfiniBand network v Perform this procedure as if it were a new clusterinstallation.

Adding HCAs to an existing InfiniBand network v Perform this procedure as if it were a new clusterinstallation.

Adding a subnet to an existing InfiniBand network v Configure the logical partitions to use the new HCAports.

v Configure the newly cabled HCA ports for switchpartitioning.

Adding servers and a subnet to an existing InfiniBandnetwork

v Perform this procedure as if it were a new clusterinstallation.

Installing the operating system and configuring the cluster serversUse this procedure to install your operating system and to configure the cluster servers.

Note: Installing and configuring the operating system and configuring the cluster server encompassesmajor tasks that are illustrated in the Figure 10 on page 54.1. S6: Customize logical partitions and host channel adapter (HCA) configuration.

Note: When setting up the logical partition profiles, you must configure the HCAs by using theprocedure found in “Installing or replacing an InfiniBand GX host channel adapter” on page 124.Ensure that you do the step that configures the globally unique identifier (GUID) index and capabilityfor the HCA in the logical partition.Define logical partitions by using the following procedures. During this procedure, configure theHCAs by using the procedure found in “Installing or replacing an InfiniBand GX host channeladapter” on page 124. Ensure that you do the steps that configure the GUID index and capability forthe HCA in the logical partition.a. See the IBM System Information Center.v For POWER6 server installation, select Power Systems information, select the model you are

working on, and then select Installing and configuring the system.v For GX RIO-2/HSL-2 adapters and GX 12x channel adapters, select Installing and configuring

the system → GX RIO-2/HSL-2 adapters and GX 12x channel adapters.2. S7: After the servers are connected to the cluster VLAN, install and update the operating systems. If

servers do not have removable media capabilities, use an AIX Network Installation Management(NIM) server or Linux distribution server to load and update the operating systems.

Clustering with high-performance computing by using InfiniBand hardware 111

Page 122: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

3. Depending on your operating system, perform the installation instructions found in one of thefollowing subprocedures. When finished with the subprocedure, proceed to step 4.a. For AIX, use the procedure in “Installing AIX.”b. For Linux, use the procedure in “Installing Linux” on page 113.

4. If you are running an embedded subnet manager, to check multicast group creation, run the followingcommand on each switch that has a master subnet manager. If you have Cluster SystemsManagement/Management Server (CSM/MS) set up, you might use the dsh command from theCSM/MS to the switches. For details, see “Setting up remote command processing” on page 103. Type--devicetype IBSwitch::Qlogic when you point to the switches.smShowGroups

There should be just one group with all the HCA devices on the subnet being part of the group. Notethat mtu=5 indicates 4 KB. mtu=4 indicates 2 KB. The following example shows 4 KB maximumtransfer unit (MTU).0xff12401bffff0000:00000000ffffffff (c000)qKey = 0x00000000 pKey = 0xFFFF mtu = 5 rate = 3 life = 19 sl = 00x00025500101a3300 F 0x00025500101a3100 F 0x00025500101a8300 F0x00025500101a8100 F 0x00025500101a6300 F 0x00025500101a6100 F0x0002550010194000 F 0x0002550010193e00 F 0x00066a00facade01 F

5. After the servers are operational and CSM is installed and can send a dsh command to the servers,map the HCAs. This helps with future fault isolation. For more details, see the procedure found in“Mapping of IBM HCA GUIDs to physical HCAs” on page 162.a. Log on to the CSM/MS.b. Create a location file for storing the HCA maps, such as /home/root/HCAmaps.

Note: If you do not have mixed AIX and Linux nodes, instead of by using the “-N” parameter inthe following commands, you can use “-a” and store all nodes in one file, for example NodeHCAmap.

c. For AIX nodes, run the dsh –v –N AIXNodes 'ibstat -n | grep GUID' > /home/root/HCAmaps/AIXNodeHCAmap command.

d. For Linux nodes, run the dsh –v –N LinuxNodes ‘ibv_devinfo -v | grep "node_guid"' >/home/root/HCAmaps/LinuxNodeHCAmap command.

Installing AIX:

Use this procedure if you are installing the AIX operating system, and you were directed to it fromanother procedure.1. Do not run the mkiba command until you have correctly set up the subnet managers for correct

maximum transfer unit (MTU) as planned by using “Planning for maximum transfer units (MTUs)”on page 34 and the “QLogic switch planning work sheets” on page 66. For host-based subnetmanagers, see “Installing the fabric management server” on page 91. For embedded subnet managers,see “Installing and configuring vendor InfiniBand switches” on page 116.The subnet managers must be running before you start to configure the interfaces in the partitions. Ifthe commands start failing and an lsdev | grep ib command reveals that the devices are stopped, it islikely that the subnet managers are not running.

2. Run the mkdev command for the Internet Cluster Manager (ICM). For example: mkdev -c management-s infiniband -t icm

3. Run the mkiba command for the devices. For example: mkiba -a [ip address] -i ib0 -A iba0 -p 1-P 1 -S up -m 255.255.255.0.

4. After the HCA device driver is installed and the mkiba command is done, run the followingcommand to set the device MTU to 4 KB and enable superpackets

112 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 123: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

for i in `lsdev | grep Infiniband | awk ’{print $1}’ | egrep -v "iba|icm"`dochdev -l $i --a superpacket=on –a tcp_recvspace=524288 –a tcp_sendspace=524288–a srq_size=16000 -a state=up

done

Note: The previous example modifies all the host channel adapter (HCA) devices in the logicalpartition. To modify a specific device (such as ib0) use the command chdev -l ib0 –a superpacket=on–a tcp_recvspace=524288 –a tcp_sendspace=524288 –a srq_size=16000 -a state=up.

5. To verify the configuration, perform the following steps:a. To verify that the devices are set with superpackets on, use the following command:

for i in `lsdev | grep Infiniband | awk ’{print $1}’ | egrep -v "iba|icm"`do

echo $ilsattr -El $i | egrep "super"

done

Note: To verify a single device (such as ib0) use the command lsattr -El ib0 | egrep"mtu|super".

b. To check the interfaces for the HCA devices (ibx) and ml0, use the following command:netstat -in | grep -v link | awk ’{print $1,$2}’

The results are shown similar to the following example, where the MTU value is in the secondcolumn:Name Mtuen2 1500ib0 65532ib1 65532ib2 65532ib3 65532ib4* 65532ib5 65532ib6 65532ib7 65532ml0 65532lo0 16896lo0 16896

Note: If you have a problem where the MTU value is not 65532, you must follow the recoveryprocedure in “Recovering ibx interfaces” on page 199.

6. To check multicast group creating if you are running a host-based subnet manager, run a query forthe specific subnet on the fabric management server. Remember that you must provide the HCA andport through which the subnet manager connects to the subnet./sbin/saquery –o mcmember –h [HCA] –p [HCA port]

Each interface produces an entry similar to the following example. Notice the 4 KB MTU and 20 grate. The rate should match the planned rate. For more information, see “Planning for maximumtransfer units (MTUs)” on page 34 and the “QLogic switch planning work sheets” on page 66.GID: 0xff12601bffff0000:0x0000000000000016PortGid: 0xfe80000000000000:0002550070010f00MLID: 0xc004 PKey: 0xffff Mtu: 4096 Rate: 20g PktLifeTime: 2147 msQKey: 0x00000000 SL: 0 FlowLabel: 0x00000 HopLimit: 0xff TClass: 0x00

Note: You can check for misconfigured interfaces by using a command similar to the followingexample, which looks for any MTU that is not 4096 and that has a rate of 10 g./sbin/saquery –o mcmember –h [HCA] –p [port] | egrep –B 3 –A 1 'Mtu: [0-3]|Rate: 10g’

7. Return to the procedure that directed you here.

Installing Linux:

Clustering with high-performance computing by using InfiniBand hardware 113

Page 124: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Use this procedure if you are installing the Linux operating system, and you were directed to it fromanother procedure.1. Perform the installation instructions for the Linux operating system. For details, see the

documentation provided with SUSE Linux Enterprise Server 10 (SP2), with IBM InfiniBand GX hostchannel adapter (HCA) driver, and Open Fabrics Enterprise Distribution (OFED) stack. Alternatively,see the instructions contained in the OFED-1.3.1 package, which is available from the Open FabricsAlliance Web site Downloads page. For more information, see the IBM Clusters with the InfiniBandSwitch Web site.

2. Confirm that ofed, ofed-kmp-ppc64, and ofed-kmp-kdump RPM package managers (rpms) areinstalled. Also, check for the optional ofed-doc rpm. Use the rpm command as shown in thefollowing example:[root on c697f1sq01][/etc/sysconfig/network] => rpm -qa | grep -i ofedofed-kmp-kdump-1.3_2.6.16.60_0.25-0.28ofed-1.3-0.28ofed-doc-1.3-0.28ofed-kmp-ppc64-1.3_2.6.16.60_0.25-0.28

3. If the above rpms have not been installed, install them now. Use the instructions from thedocumentation that is provided with SUSE Linux Enterprise Server 10 (SP2), with IBM InfiniBandGX HCA driver, and Open Fabrics Enterprise Distribution (OFED) stack. Alternatively, see theinstructions contained in OFED-1.3.1 package available from the Open Fabrics Alliance Web site. Formore information, see the IBM Clusters with the InfiniBand Switch Web site.

4. Set up configuration files:a. Edit the /etc/modprobe.conf.local file:

1) If all InfiniBand HCA ports are to be used consecutively, use the nr_ports=-1 option.Otherwise, go to step 4a2.## add local extensions to this file#options ib_ehca nr_ports=-1options ib_ipoib send_queue_size=256 recv_queue_size=512

2) If only the first port in each InfiniBand HCA is to be used, use the nr_ports=1 option:## add local extensions to this file#options ib_ehca nr_ports=1options ib_ipoib send_queue_size=256 recv_queue_size=512

b. Edit the /etc/sysconfig/network/ifcfg-ibX file, where X is the InfiniBand interface number.There is one file for each active interface. These files are read at boot time when the IPoIB code isloaded.The following are the important fields.BOOTPROTO=BROADCAST=IPADDR= # ip address of the interfaceMTU= # MTU for the interface (see “Planning for maximum transfer units (MTUs)” on page 34and the “QLogic switch planning work sheets” on page 66.A blank value = 2KB MTU default)NETMASK=NETWORK=REMOTE_IPADDR=STARTMODE=

Example: A server with four InfiniBand interfaces could have files similar to the followingexample.[root on c697f1sq01][/etc/sysconfig/network] => cat ifcfg-ib0BOOTPROTO=’static’BROADCAST=’10.0.1.255’IPADDR=’10.0.1.1’MTU=’’

114 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 125: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

NETMASK=’255.255.255.0’NETWORK=’10.0.1.0’REMOTE_IPADDR=’’STARTMODE=’onboot’

[root on c697f1sq01][/etc/sysconfig/network] => cat ifcfg-ib1BOOTPROTO=’static’BROADCAST=’10.0.2.255’IPADDR=’10.0.2.1’MTU=’’NETMASK=’255.255.255.0’NETWORK=’10.0.2.0’REMOTE_IPADDR=’’STARTMODE=’onboot’

[root on c697f1sq01][/etc/sysconfig/network] => cat ifcfg-ib2BOOTPROTO=’static’BROADCAST=’10.0.3.255’IPADDR=’10.0.3.1’MTU=’’NETMASK=’255.255.255.0’NETWORK=’10.0.3.0’REMOTE_IPADDR=’’STARTMODE=’onboot’

[root on c697f1sq01][/etc/sysconfig/network] => cat ifcfg-ib3BOOTPROTO=’static’BROADCAST=’10.0.4.255’IPADDR=’10.0.4.1’MTU=’’NETMASK=’255.255.255.0’NETWORK=’10.0.4.0’REMOTE_IPADDR=’’STARTMODE=’onboot’

5. Restart the server after you set up configuration files.6. Verify that the IPoIB process starts. Use the lsmod command.7. Use the ifconfig ibX command to verify interface operation.

Example output (note the correct configuration based on the /etc/sysconfig/ifcfg-ib0 file, and thatthe broadcast is running.):[root on c697f1sq01][/etc/sysconfig/network] => ifconfig ib0ib0 Link encap:UNSPEC HWaddr 80-00-08-24-FE-80-00-00-00-00-00-00-00-00-00-00

inet addr:10.0.1.1 Bcast:10.0.1.255 Mask:255.255.255.0inet6 addr: fe80::202:5500:1001:2900/64 Scope:LinkUP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1RX packets:895100 errors:0 dropped:0 overruns:0 frame:0TX packets:89686 errors:0 dropped:0 overruns:0 carrier:0collisions:0 txqueuelen:512RX bytes:50136680 (47.8 Mb) TX bytes:5393192 (5.1 Mb)

8. Use the netstat –i command to verify the interface table.Example output with 4 KB MTU configuration:Iface MTU Met RX-OKB RX-ERR RX-DRP RX-OVR TX-OKB TX-ERR TX-DRP TX-OVR Flgeth0 1500 0 1141647 0 0 0 122790 0 0 0 BMRUib0 4092 0 1028150 0 0 0 102996 0 0 0 BMRUib1 4092 0 1028260 0 0 0 102937 0 0 0 BMRUib2 4092 0 1028494 0 0 0 102901 0 0 0 BMRUib3 4092 0 1028293 0 0 0 102910 0 0 0 BMRUlo 16436 0 513906 0 0 0 513906 0 0 0 LRU

9. Use the netstat –rn command to verify the routing table.Example output:[root on c697f1sq01][/etc/init.d] => netstat -rnKernel IP routing tableDestination Gateway Genmask Flags MSS Window irtt Iface

Clustering with high-performance computing by using InfiniBand hardware 115

Page 126: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

9.114.28.64 0.0.0.0 255.255.255.192 U 0 0 0 eth010.0.4.0 0.0.0.0 255.255.255.0 U 0 0 0 ib310.0.1.0 0.0.0.0 255.255.255.0 U 0 0 0 ib010.0.2.0 0.0.0.0 255.255.255.0 U 0 0 0 ib110.0.3.0 0.0.0.0 255.255.255.0 U 0 0 0 ib2169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0127.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 lo0.0.0.0 9.114.28.126 0.0.0.0 UG 0 0 0 eth0

10. Return to the procedure that directed you here.

Installing and configuring vendor InfiniBand switchesUse this procedure if you are responsible for installing the vendor switches.

The InfiniBand switch installation and configuration encompasses major tasks W1 through W6 that areshown in the Figure 10 on page 54.

Note: If possible, this procedure should not begin before the management subsystem installation andconfiguration procedure is completed. This avoids the situation where various installation personnel arewaiting on-site for key parts of this procedure to be completed. Depending on the arrival of units on-site,this is not always practical. Therefore, it is important to review the “Order of installation” on page 52 andthe Figure 10 on page 54 to identify the merge points where a step in a major task being performed byone person is dependent on the completion of steps in another major task being performed by anotherperson.

Before installing and configuring vendor InfiniBand switches, obtain the following documentation:v QLogic Switch Users Guide and Quick Setup Guide

v QLogic Best Practices Guide for a Cluster

From your installation planner, obtain the “QLogic switch planning work sheets” on page 66.

Installing and configuring InfiniBand switches when expanding or adding to anexisting clusterIf you are adding or expanding InfiniBand network capabilities to an existing cluster, you might need toapproach the InfiniBand switch installation and configuration differently than with a new cluster flow.The flow for InfiniBand switch installation and configuration is based on a new cluster installation, but itindicates where there are variances for expansion scenarios.

The following table shows how the new cluster installation is affected by expansion scenarios.

Table 59. Effects of expansion scenarios on cluster installation

Scenario Effects

Adding InfiniBand hardware to an existing cluster(switches and host channel adapters (HCAs))

Perform this task as if it were a new cluster installation.

Adding new servers to an existing InfiniBand network You should not have to perform anything outlined in thismajor task.

Adding HCAs to an existing InfiniBand network You should not have to perform anything outlined in thismajor task.

Adding a subnet to an existing InfiniBand network Perform this task on new switches as if it were a newcluster installation.

Adding servers and a subnet to an existing InfiniBandnetwork

Perform this task on new switches as if it were a newcluster installation.

Installing and configuring the InfiniBand switchUse this procedure to install and configure InfiniBand switches.

116 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 127: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

It is possible to perform some of the tasks in this procedure in a method other than that which isdescribed. If you use other methods for configuring switches, you must review a few key points in theinstallation process outlined in this procedure. These key points are related to the order and coordinationof tasks and configuration settings that are required in a cluster environment.

Key points for installing and configuring the InfiniBand switch:

Review the following list of key points before beginning the switch installation process:1. Power on the InfiniBand switches and configure their IP addresses before attaching them to the

cluster virtual local area network (VLAN). Alternatively, you must add each switch to the clusterVLAN individually and change the default IP address before adding another switch.

Note: The switch vendor documentation refers to the Ethernet connection for switch managementas the service VLAN.

2. Set the static IP addresses on the switches for the cluster VLAN.

Notes:

v If a switch has multiple managed spines or management modules, each one requires its own IPaddress, in addition to an overall chassis IP address.

v You also need to set up the default gateway.v If an InfiniBand switch has multiple Ethernet connections for the cluster VLAN, and the cluster

has multiple cluster VLANs for redundancy, the Ethernet ports on the switch must connect to thesame cluster VLAN.

3. Update the switch firmware code as required. See the IBM Clusters with the InfiniBand Switch Website for information regarding switch code levels.

4. Set the switch name.5. Temporarily stop the embedded subnet manager and performance manager from running.

Depending on the configuration, this might be a permanent state.6. Set up logging:

a. Enable full logging.b. Enable the full logging format.c. Point switch logs to the Cluster Systems Management/Management Server (CSM/MS).

7. Set the chassis maximum transfer unit (MTU) value according to the installation plan.8. If the switch is not running an embedded subnet manager, complete the following tasks:

a. Ensure that the embedded subnet manager is disabled.b. Disable the performance manager.c. Disable the default broadcast group.

9. If the switch is running an embedded subnet manager, complete the following tasks:a. Use the license key to enable the embedded subnet manager to be run on the switch.b. Set up the priority based on the fabric management work sheet.c. Set the global identifier (GID) prefix value according to the installation plan. See the “QLogic

switch planning work sheets” on page 66 or “Planning for global identifier prefixes” on page 35.d. If this is a high-performance computing (HPC) environment, set the LID Mask Control (LMC)

value to 2.e. Set the broadcast MTU value according to the installation plan. See the “QLogic switch planning

work sheets” on page 66 or “Planning for maximum transfer units (MTUs)” on page 34.10. Point to the Network Time Protocol (NTP) server.11. Instruct the customer to verify that the switch is detected by the CSM/MS by using the verify

detection step in the Verifying the InfiniBand network topology and operation topic.

Clustering with high-performance computing by using InfiniBand hardware 117

Page 128: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Note: If you are expanding an existing cluster, also consider using the QLogic switch command help. Onthe command-line interface (CLI), use the help command name command. Otherwise, see the users guidesfor information about the commands and to identify the appropriate command in its proceduraldocumentation.

Installing and configuring InfiniBand switches:

Complete the following procedure to install and configure your InfiniBand switches:

Note: The tasks W1 through W6 described in the following steps are based on the major steps found inthe Figure 10 on page 54.1. Review this procedure and determine whether the Fabric Management Server has the Fast Fabric

Toolset installed and is on the cluster VLAN before you finish this procedure. If Fast Fabric tools areavailable, you can customize the multiple switches simultaneously after you have them configuredwith unique IP addresses and they are attached to the cluster VLAN. If you do not have Fast Fabrictools ready, you need to customize each switch individually. In that case, you might want to do thecustomization step right after you set up the switch management IP address and give it a name.

2. W1: Physically place frames and switches on the data center floor, and complete the following steps:a. Review the vendor documentation for each switch model that you are installing.b. Physically install the InfiniBand switches into 19-inch frames (or racks) and attach power cables to

the switches according to the instructions for the InfiniBand switch model. This automaticallypowers on the switches. There is no power switch for the switches.

Note: Do not connect the Ethernet connections for the cluster VLAN at this time.3. W2: Set up the Ethernet interface for the cluster VLAN by setting the switch to a fixed IP address,

which is provided by the customer. See the “QLogic switch planning work sheets” on page 66. Usethe procedure in the vendor documentation for setting switch addresses.

Notes:

v You can attach a notebook to the serial port of the switch, or you can attach each switchindividually to the cluster VLAN, use the default address to get into the CLI, and customize thestatic IP address.

v As indicated in “Planning for QLogic InfiniBand switch configurations” on page 32, QLogicswitches with managed spine modules have multiple addresses. There is an address for eachmanaged spine, as well as an overall chassis address that is used by whichever spine is master atany given time.

v If you are customizing the IP address of the switch by accessing the CLI through the serial port onthe switch, you might want to leave the CLI open to perform the rest of the customization. This isnot necessary if the Fast Fabric Toolset has been installed and can access the switches, because withFast Fabric tools, you can update multiple switches simultaneously.

v For QLogic switches, the key commands are setChassisIpAddr and setDefaultRoute.v Use an appropriate subnet mask when setting up the IP addresses.

4. Set the switch name. For QLogic switches, use the setIBNodeDesc command.5. Disable the subnet manager functions and performance manager functions. If embedded subnet

management is used, this is reversed after the network cabling is completed.v Ensure that the embedded subnet manager is not running by using the smControl stop command.v Ensure that the embedded subnet manager does not start at boot by using the smConfig

startAtBoot no command.v Ensure that the performance manager is not running by using the smPmBmStart disable command.

6. W3: Attach the switch to the cluster VLAN.

118 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 129: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Note: If the switch has multiple Ethernet connections, they must all attach to the same Ethernetsubnet.

7. W4: For QLogic switches, if the Fast Fabric Toolset is installed on the fabric management server verifythat the Fast Fabric tools can access the switch. Referring to the Fast Fabric Toolset Users Guide, use asimple query command or ping test to the switch. For example, the pingall command could be usedas long as you point to the switch chassis and not to the servers or nodes.

8. W5: Verify that the switch code matches the latest supported level indicated in IBM Clusters with theInfiniBand Switch Web site. Check the switch software level by using a method described in vendor'sswitch users guides. These guides also describe how to update the switch's software, which isavailable on the vendor's Web site. For QLogic switches, one of the following guides and methods aresuggested:v You can check each switch individually by using a command on its command line interface (CLI).

This command can be found in the switch users guide for the model.v If the Fast Fabric Toolset is installed on the fabric management server, you can check the code

levels of multiple switch simultaneously by using techniques found in the Fast Fabric Toolset UsersGuide.

v You can use the fwVersion command. If this command is issued by using Fast Fabric tools, thecmdall command can be used to issue this command to all switches simultaneously.

v For updating multiple switches simultaneously, use the Fast Fabric Toolset.9. W6: Finalize the configuration for each InfiniBand switch.

You are setting up the final switch and subnet manager configuration. You planned the followingvalues in the planning phase (see “Planning InfiniBand network cabling and configuration” on page32 and the “QLogic switch planning work sheets” on page 66).v Subnet manager priorityv MTUv LMCv GID prefixv Node appearance and disappearance log thresholdFor QLogic switches, the pertinent commands and user manuals and methods to be used by thisprocedure follow:v You can work with each switch individually by using a command on its CLI.v If the Fast Fabric Toolset is installed on the fabric management server, you can check the code

levels of multiple switch simultaneously by using techniques found in the Fast Fabric Toolset UsersGuide. Set the chassis maximum transfer unit (MTU) value according to the installation plan. Seethe “QLogic switch planning work sheets” on page 66 or “Planning for maximum transfer units(MTUs)” on page 34.

v For setting chassis MTU use the ismChassisSetMtuvalue command on each switch (4 equals 2 KB; 5equals 4 KB).

v For each embedded subnet manager, use the following commands for final configuration:– For the priority: smPrioritypriority

– For LMC=2: smMasterLMC=2– For 4 KB broadcast MTU with default pkey: smDefBcGroup 0xFFFF 5rate (rate: 3 equals SDR; 6

equals DDR rate)– For 2 KB broadcast MTU with default pkey: smDefBcGroup 0xFFFF 4rate (rate: 3 equals SDR; 6

equals DDR rate)– For GID prefix: smGidPrefixGID-prefix value

– For node appearance or disappearance threshold of 10: smAppearanceMsgThresh 10a. If this switch has an embedded subnet manager, complete the following steps:

Clustering with high-performance computing by using InfiniBand hardware 119

Page 130: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

1) Enable the subnet manager for operation by using the license key. Do not start the embeddedsubnet manager; you will start it later, in the procedure “Attaching cables to the InfiniBandnetwork.” Use the addKeykey command.

2) Set the GID-prefix value according to the installation plan. See the “QLogic switch planningwork sheets” on page 66 or “Planning for global identifier prefixes” on page 35.

3) If this is a high-performance computing (HPC) environment, set the LMC value to 2.b. Set the broadcast MTU value according to the installation plan. See the “QLogic switch planning

work sheets” on page 66 or “Planning for maximum transfer units (MTUs)” on page 34.c. If applicable, point to the NTP server. For QLogic switches, this is done by using the time

command. Details are in the Switch Users Guide. Typical commands from the fast fabricmanagement server are as follows. If remote command processing is set up on the CSM/MS, youcan use the dsh command instead of cmdall. Remember to use the --devicetype IBSwitch::Qlogiccommand to access the switches.1) If applicable, set the time by using the Network Time Protocol (NTP) server: cmdall -C 'time -S

[NTP server IP-address]2) If no NTP server is present, set the local time: cmdall –C ‘time –T hhmmss[mmddyyyy]'3) Set the time zone, where X is the offset of the time zone from GMT: cmdall -C 'timeZoneConf

X'4) Set the daylight saving time, where X is the offset of the time zone from GMT: cmdall -C

'timeDSTTimeoutX'

If you are also responsible for cabling the InfiniBand network, proceed to “Attaching cables to theInfiniBand network.” Otherwise, you can return to the overview of the installation section to findyour next set of installation tasks.

Other installation tasks involving final configuration of switches are:v “Setting up remote logging” on page 95v “Setting up remote command processing” on page 103

Attaching cables to the InfiniBand networkUse this procedure if you are responsible for installing the cables on the InfiniBand network.

Cabling the InfiniBand network encompasses major tasks C1 through C4, which are shown in Figure 10on page 54.

Note: Do not start this procedure until InfiniBand switches have been physically installed. Wait until theservers have been configured. This avoids the situation where installation personnel are waiting on-sitefor key parts of this procedure to be completed. Depending on the arrival of units on-site, this is notalways practical. Therefore, it is important to review the “Order of installation” on page 52 and Figure 10on page 54 to identify the merge points where a step in a major task being performed by one person isdependent on the completion of steps in another major task being performed by another person.

Before attaching the cables to the InfiniBand network, obtain the following documentation.v QLogic Switch Users Guide and Quick Setup Guide

v QLogic Best Practices Guide for a Cluster

Obtain the following information from the installation planner:v Cable planning informationv “QLogic switch planning work sheets” on page 66

120 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 131: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Cabling the InfiniBand network for expansionIf you are adding or expanding your InfiniBand network capabilities to an existing cluster, you mightneed to approach cabling the InfiniBand differently than with a new cluster flow. The flow for cabling theInfiniBand network is based on a new cluster installation, but it indicates where there are variances forexpansion scenarios.

Table 60 shows how the new cluster installation is affected or altered by expansion scenarios.

Table 60. Effects of expanding an existing cluster

Scenario Effects

Adding InfiniBand hardware to an existing cluster(switches and host channel adapters (HCAs))

Perform this task as if it were a new cluster installation.All InfiniBand hardware is new to the cluster.

Adding new servers to an existing InfiniBand network Perform this task as if it were a new cluster installationfor all new servers and as if HCAs were added to theexisting cluster.

Adding HCAs to an existing InfiniBand network Perform this task as if it were a new cluster installationfor all new HCAs that were added to the existing cluster.

Adding a subnet to an existing InfiniBand network Perform this task as if it were a new cluster installationfor all new switches that were added to the existingcluster.

Adding servers and a subnet to an existing InfiniBandnetwork

Perform this task as if it were a new cluster installationfor all new servers, HCAs, and switches that were addedto the existing cluster.

Cabling the InfiniBand networkUse this procedure to cable your InfiniBand network.

It is possible to perform some of the tasks in this procedure by using a method other than that which isdescribed. If you have other methods for cabling the InfiniBand network, you still must review a few keypoints in the installation process regarding order and coordination of tasks and configuration settings thatare required in a cluster environment.

Note: IBM is responsible for replacing faulty or damaged cables that have IBM part numbers.

To cable your switch network, complete the following steps:

Note: The switch cabling encompasses major tasks C1 and C4, which are shown in Figure 10 on page 54.1. Obtain and review a copy of the cable plan for the InfiniBand network.2. Label the cable ends before routing the cable.3. Power on the switches before attaching cables to them.4. C1 - Route the InfiniBand cables according to the cable plan and attach them to only the switch ports.

Refer to the switch vendor documentation for more information about how to plug cables.5. C4 - Connect the InfiniBand cables to the host channel adapter (HCA) ports according to the planning

documentation.6. If both servers and switches have power applied as you complete cable connections, you should check

the port LEDs while you plug in the cables. See the switch vendor's Switch Users Guide to understandthe correct LED states. Fabric management can now be started.

Note: Depending on assigned installation responsibilities, it is possible that someone else might needto perform these actions. Coordinate this with the appropriate people.v For QLogic embedded subnet managers, use the smControl start command, the smPmBmStart

enable command, and the smConfig startAtBoot yes command. The latter command can be issued

Clustering with high-performance computing by using InfiniBand hardware 121

Page 132: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

at the switch command line, or by using the Fast Fabric cmdall command. QLogic host-basedFabric Manager uses the iview_fm start command as instructed in “Installing the fabricmanagement server” on page 91. Contact the person installing the fabric management server andindicate that the Fabric Manager should not be started on the fabric management server.

Note: If you are responsible for verifying the InfiniBand network topology and operation, you canproceed to that procedure.

Verifying the InfiniBand network topology and operationUse this procedure to verify the network topology and operation of your InfiniBand network. Thisprocedure is performed by the customer.

Verifying the InfiniBand network topology and operation encompasses major tasks V1 through V3, whichare shown in the Figure 10 on page 54.

Note: This procedure cannot be performed until all other procedures for cluster installation have beencompleted. These procedures include the management subsystem installation and configuration, serverinstallation and configuration, InfiniBand switch installation and configuration, and attaching cables tothe InfiniBand network.

The following documents are referred to in this procedure:v For IBM units:

– IBM host channel adapter (HCA) available from your service representative– Troubleshooting, service, and support in the IBM Power Systems Hardware Information Center

v For QLogic units:– Fast Fabric Toolset Users Guide

– Switch Users Guide

– Fabric Manager and Fabric Viewer Users Guide

Note: It is possible to perform some of the tasks in this procedure by using a method other than thatwhich is described. If you have other methods for verifying the operation of the InfiniBand network,you still must review a few key points in this installation process regarding order and coordination oftasks and configuration settings that are required in a cluster environment.

v This procedure cannot be performed until all other procedures in the cluster installation have beencompleted. These include the following procedures:– Management subsystem installation and configuration, including:

- Fabric Manager- Fast Fabric Toolset

– Server installation and configuration– InfiniBand switch installation and configuration– Cabling the InfiniBand network

Note: Exceptions for tasks that must be installed before performing this verification procedure includeinstallation of the IBM high-performance computing (HPC) software stack and other customer-specificsoftware above the driver level.

v IBM service is responsible for replacing faulty or damaged cables that have IBM part numbers.v Vendor service or the customer is responsible for replacing faulty or damaged cables that do not have

IBM part numbers.v The customer should check the availability of HCAs to the operating system before any application is

run to verify network operation.

122 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 133: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

v If you find a problem with a link that might be caused by a faulty HCA or cable, contact your servicerepresentative for repair.

v This is the final procedure in installing an IBM cluster with an InfiniBand network.

The following procedure provides additional details that can help you perform the verification of yournetwork.1. To verify the network topology, complete the following steps:

a. Check all power LEDs on all the switches and servers to ensure that they are on. See the vendorswitch users guide, or see the IBM systems service documentation for information about thecorrect LED states.

Note: Your service representative will also have documentation on the correct LED states.b. Check all LEDs for the switch ports to verify that they are correctly lit. See the vendor switch

users guide, or see the IBM systems service documentation for information about the correct LEDstates.

c. Check the Manage serviceable events task on the Hardware Management Console (HMC) forserver and HCA problems. Perform service before proceeding. If necessary, contact IBM Service toperform service.

d. Verify that switches have the correct connectivity and that they are correctly set up on themanagement subsystem. If you find any problems, check the Ethernet connectivity of the switches,management servers, and Ethernet devices in the cluster virtual local area network (VLAN). Toverify the connectivity of the switches, perform the following steps:1) On the fabric management server, perform a pingall command to all the switches that use the

instructions found in the Fast Fabric Toolset Users Guide. Assuming that you have set up thedefault chassis file to include all switches, this would be the pingall -C command.

2) On the Cluster Systems Management/Management Server (CSM/MS), ping the switches, orissue an fwVersion command to all the switches by using the dsh command. This is madeeasier by using the IBSwitch:Qlogic device type or by setting up device groups as instructedin “Setting up remote command processing” on page 103. The following example shows usingthe IBSwitch:Qlogic device type and the fwVersion command:dsh –D IBSwitches –devicetype IBSwitch::Qlogic fwVersion

3) If available, from a console connected to the cluster VLAN, open a browser and use eachswitch's IP address as a URL to verify that the Chassis Viewer is operational on each switch.The QLogic Switch Users Guide contains information about the Chassis Viewer.

4) If you have the QLogic Fabric Viewer installed, start it and verify that all the switches arevisible on all the subnets. The QLogic Fabric Manager and the Fabric Viewer Users Guidecontain information about the Fabric Viewer.

e. Verify that the switches are correctly cabled by running the baseline health check as documentedin the Fast Fabric Toolset Users Guide. These tools are run on the fabric management server. Toverify that the switches are correctly cabled, perform the following steps:1) Clear all the error counters by using the cmdall -C 'ismPortStats -clear -noprompt' command.2) Run the all_analysis –b command.3) Go to the baseline directory as documented in the Fast Fabric Toolset Users Guide.4) Check the fabric.*.links files to ensure that everything is connected as it should be.

You need a map to identify the location of the IBM HCA globally unique identifiers (GUIDs)that are attached to switch ports. See “Mapping fabric devices” on page 162 for instructions onhow to do this mapping.

5) If a cable is not connected correctly, fix it and rerun the baseline check.2. To verify the InfiniBand fabric operation for the cluster, complete the following steps:

Clustering with high-performance computing by using InfiniBand hardware 123

Page 134: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

a. Verify that the HCAs are available to the operating system in each logical partition. You can usethe dsh command from the CSM/MS to issue commands to multiple logical partitionssimultaneously.1) For logical partitions that run the AIX operating system, check the HCA status by running the

lsdev -C | grep ib command. An example of good results for verifying a GX HCA is:Available InfiniBand host channel adapter

2) For logical partitions that run the Linux operating system, see the documentation that isprovided with SUSE Linux Enterprise Server 10 (SP2) with IBM InfiniBand GX HCA driverand Open Fabrics Enterprise Distribution (OFED) Stack. Alternatively, see the instructions thatare contained in the OFED-1.3.1 package, which is available for download from the OpenFabrics Alliance Web site.

b. To verify that there are no problems with the fabric, complete the following steps:1) Inspect the CSM/MS /var/log/csm/errors/CSM/MS hostname log for subnet manager and

switch log entries. For details on how to read the log, see “Interpreting switch log formatsfrom vendors” on page 172. If a problem is encountered, see “Servicing clusters” on page 149.

2) Run the Fast Fabric health check by using the instructions found in “Health checking” on page135. If a problem is encountered, see “Servicing clusters” on page 149.

c. Run a fabric verification application to send data on the fabric. For the procedure to run a fabricverification application, see “Fabric verification” on page 127. This procedure also checks for faults.

d. After running the fabric verification tool, perform the checks recommended in “Fabric verification”on page 127.

3. After fixing the problems, run the baseline health check again to help monitor fabric health and todiagnose problems. Use the /sbin/all_analysis -b command.

4. Clear all the switch logs to start with a clean log. However, you want to make a copy of the logsbefore proceeding. To copy the logs, complete the following steps:a. Create a directory for storing the state at the end of installation by using the following command:

/var/opt/iba/analysis/install_capture

b. If you have the /etc/sysconfig/iba/chassis file configured with all switch chassis listed, issue thecaptureall –C –d command /var/opt/iba/analysis/install_capture command.

c. If you have another file configured with all switch chassis listed, enter the following commandcaptureall –C –F [file with all switch chassis listed] –d /var/opt/iba/analysis/install_capture

d. Run the cmdall -C 'logClear' command.

The InfiniBand network is now installed and available for operation.

Installing or replacing an InfiniBand GX host channel adapterUse this procedure to guide you through the process for installing or replacing an InfiniBand GX hostchannel adapter (HCA).

The process of installing or replacing an InfiniBand GX HCA consists of the following tasks:v Physically installing or replacing the adapter hardware into your system unit.v Configuring the logical partition profiles with a new globally unique identifier (GUID) for the new

adapter in your switch environment.v Verifying that the HCA is recognized by the operating system.

Notes:

1. If you are considering deferred maintenance of a GX HCA, review “Deferring replacement of a failinghost channel adapter” on page 126.

2. If you replace an HCA, it is possible that the new HCA could be defective in a way that prevents thelogical partition from being activated. In this case, a notification appears on the controlling HardwareManagement Console (HMC). If this occurs, decide whether you want to replace the new, defective

124 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 135: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

HCA immediately, or whether you want to defer maintenance and continue activating the logicalpartition. To defer maintenance and continue activating the logical partition, you must unassign theHCA in all the logical partition profiles that contain the HCA by using the procedure in Recoveringfrom an HCA preventing a logical partition from activating.

To install or replace an InfiniBand GX HCA, complete the following steps:1. Work with your next level of support to obtain the installation instructions.2. If you are performing an adapter replacement, first record information about the adapter being

replaced. Important information includes the logical partitions in which it is used, the GUID indexused in each logical partition, and the capacity used in each logical partition. To record thisinformation, complete the following steps from the HMC that manages the server in which the HCAis installed:a. Obtain the list of logical partition profiles that use the HCA. If there is no list, proceed to the next

step.b. Obtain or record the GUID index and capability settings in the logical partition profiles that use

the HCA by using the following steps:1) Go to the Systems Management window.2) Click the Servers partition > the server in which the HCA is installed > the logical partition to

be configured.3) Expand each logical partition that uses the HCA. If you do not know which logical partition

uses the HCA, complete the following steps for each logical partition profile and record whichones use the HCA, as well as the GUID index and capability settings:a) Click each logical partition profile that uses the HCA.b) From the menu, click Selected → Properties.c) On the Properties page, click the HCA tab.d) Using its physical location, find the HCA of interest.e) Record the GUID index and capability settings.

3. Install or replace the adapter in the system unit. For instructions on installing an InfiniBand GX HCAin your system unit, see the RIO/HSL or InfiniBand adapter information in the IBM Power SystemsHardware Information Center.

Note: When an HCA is added to a logical partition, the HCA becomes a required resource for thelogical partition. If the HCA fails in such a way that the GARD function for the system prevents itfrom being used, the logical partition cannot be reactivated. If this occurs, a message is displayed onthe controlling HMC that indicates that you need to unassign the HCA from the logical partition tocontinue activation. The GARD function is called for serious adapter or bus failures that could impairsystem operation, such as ECC errors or state machine errors. InfiniBand link errors should not callthe GARD function.

4. Update the logical partition profiles (for all logical partitions that uses the new GX HCA) with thenew GUID for the new InfiniBand GX HCA.

Note: Each InfiniBand GX HCA has a GUID that is assigned by the manufacturer. If any of theseadapters are replaced or moved, the logical partition profiles for all logical partitions that use the newGX HCA must be updated with the new GUID. You can do this from the HMC that is used tomanage the server in which the HCA is installed.To update the logical partition profiles, complete the following steps:a. Go to the Server and Partition window.b. Click the Server Management partition.c. Expand the server in which the HCA is populated > the partitions under the server > each

partition that uses the HCA.d. Complete the following steps for each partition profile that uses the HCA:

Clustering with high-performance computing by using InfiniBand hardware 125

Page 136: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

1) From the menu, click Selected → Properties.2) On the Properties page, click the HCA tab.3) Using its physical location, find and click the HCA of interest.4) Click Configure.5) Enter the GUID index and Capability settings. If this is a new installation, obtain these settings

from the installation plan information. If this is a repair action, reference the setting that youpreviously recorded in step 2 on page 125.

6) If the replacement HCA is in a different location than the original HCA, you should clear theoriginal HCA information from the partition profile by selecting the original HCA by itsphysical location and clicking Clear.

Note: If the following message occurs when you attempt to assign a new GUID, you might beable to recover from this error without the help of a service representative.A hardware error has been detected for the adapterU787B.001.DNW45FD-P1-Cx. You cannot configure thedevice at this time. Contact your service provider

The manage serviceable events task can be accessed on your HMC. See the Start of callprocedure in the service information for the server, and perform the indicated procedures.Check the serviceable events for reports that are related to this error. Perform any recoveryactions that are indicated. If you cannot recover from this error, contact your servicerepresentative.

5. After the server is started, verify that the HCA is recognized by the operating system. For moreinformation, see “Verifying the installed InfiniBand network fabric in AIX or Linux” on page 127.

You have finished installing and configuring the adapter. If you were directed here from anotherprocedure, return to that procedure.

Deferring replacement of a failing host channel adapterIf you plan to defer maintenance of a failing host channel adapter (HCA), there is a risk of the HCAfailing in such a way that it could prevent future logical partition reactivation.

To assess the risk, determine if there is a possibility of the HCA preventing the reactivation of the logicalpartition. If this is possible, you must consider the probability of rebooting the partition whilemaintenance is deferred.

To determine the risk, complete the following steps on the Hardware Management Console (HMC):1. Go to the Server and Partition window.2. Click the Server Management partition.3. Expand the server in which the HCA is installed > the partitions under the server > each partition

that uses the HCA. If you do not know which logical partition uses the HCA, you must expand thefollowing menus for each logical partition profile, and record which logical partitions use the HCA.

4. Expand each partition that uses the HCA. If you do not know which logical partition uses the HCA,complete the following steps for each logical partition profile, and record which logical partitions usethe HCA.a. Select each logical partition profile that uses the HCA.b. From the menu, click Selected → Properties.c. On the Properties page, click the HCA tab.d. Using its physical location, locate the HCA of interest.e. Verify that the HCA is managed by the HMC.

5. To determine whether to defer maintenance, there are two possibilities:v If you find that the HCA is not managed by the HMC, it has failed in such a way that GARD will

be turned off during the next IPL. Therefore, consider that until maintenance is performed, any of

126 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 137: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

the logical partitions using the failed HCA might not correctly activate until the HCA is unassigned.This affects future IPLs that you perform during the deferred maintenance period. Also, any otherfailure that requires a rebooting the partition also results in the partition not being activatedcorrectly. To unassign an HCA, please see Recovering from an HCA preventing a logical partitionfrom activating. If you unassign the adapter while the logical partition is active, the HCA isunassigned the next time you reboot the partition

v If the HCA is managed by the HMC, the HCA failure does not result in the HCA receiving GARDprotection. Also, deferred maintenance will not risk the prevention of logical partition activationbecause the HCA has GARD protection.

Verifying the installed InfiniBand network fabric in AIX or LinuxUse this procedure to verify the installed InfiniBand network fabric in the AIX or Linux operatingsystems after the InfiniBand network is installed. The GX adapters and the network fabric must beverified through the operating system.

To verify the installed InfiniBand network fabric in the AIX or Linux operating system, see the followinginformation:v For the AIX operating system, see “Verifying the GX HCA connectivity by using AIX.”v For the Linux operating system, see “Verifying the GX HCA to InfiniBand fabric connectivity by using

Linux.”

Verifying the GX HCA connectivity by using AIXUse this procedure to check the status of a GX host channel adapter (HCA) by using the AIX operatingsystem.

To verify the GX HCA connectivity in AIX, check the HCA status by running the lsdev -C | grep ibscript.

An example of good results for verifying a GX HCA would be similar to the following example:

iba0 Available Infiniband host channel adapter.

Verifying the GX HCA to InfiniBand fabric connectivity by using LinuxUse this procedure to check the status of a GX host channel adapter (HCA) by using the Linux operatingsystem.

See the documentation provided with the SUSE Linux Enterprise Server 10 (SP2), provided with the IBMInfiniBand GX HCA driver, and the OpenIB Gen2 Stack. Also, refer to the instructions in the eHCADinstallation file that is contained within the download files from the SourceForge Web site.

Fabric verificationThis information describes how to run a fabric verification application and how to check for faults toverify fabric operation.

Recommendations for fabric verification applications are found on the IBM Clusters with the InfiniBandswitch Web site. You can also choose to run your own application.

You need to consider how much of the application environment you must start before running yourchosen application. The recommendations on the IBM clusters with the InfiniBand switch Web site require aminimal application environment, and therefore allow for verifying the fabric as early as possible in theinstallation process.

If you choose to run your own application, you can still use the verification steps in “Verifying the fabricoperation” on page 128 as part of your fabric verification procedure.

Clustering with high-performance computing by using InfiniBand hardware 127

Page 138: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Fabric verification responsibilitiesUnless otherwise agreed on, running the Fabric Verification Tool is the responsibility of the customer.

IBM service is responsible for replacing faulty or damaged cables that have IBM part numbers and thatare attached to IBM serviceable servers. Otherwise, either vendor service or the customer is responsiblefor replacing faulty or damaged cables that either do not have IBM part numbers or are attached tocustomer serviceable units.

Reference documentation for the fabric verification proceduresSee the reference documentation for the Fabric verification procedures.

To perform fabric verification procedures, obtain the following documentation:v As applicable, the Fabric verification application documentation and readme file.v Fast Fabric Toolset Users Guide

v QLogic Troubleshooting Guide

v QLogic Switch Users Guide

Fabric verification tasksUse this procedure to learn how to verify the fabric operation.

To verify fabric operation, complete the following steps:1. Install the fabric verification application.2. Set up the fabric verification application.3. Clear error counters in the fabric to have a clean reference point for subsequent health checks.4. Perform verification by completing the following steps:

a. Run the fabric verification application.b. Look for events that reveals fabric problems.c. Run a health check.

5. Repeat steps 3 and 4 until no additional problems are found in the fabric.

Verifying the fabric operationUse this procedure for fabric verification.

To verify fabric operation, complete the following steps:1. Install the fabric verification application by using instructions that accompany the application.2. Clear the error counters in the fabric by using the /sbin/iba_report –C –o none script.3. Run the fabric verification application by using instructions that accompany the application. If there

are multiple passes, return to step 2 for each pass.4. Check for problems by using the following steps:

a. Check serviceable events on all Hardware Management Consoles (HMCs). If there is a serviceableevent reported, contact IBM Service. If you set up service event monitoring as in “Setting upremote logging” on page 95, you can check for events on the Cluster SystemsManagement/Management Server (CSM/MS) first by using the procedures for monitoring in theCSM Administration Guide.

b. Check the switch and subnet manager logs:1) On the CSM/MS, check the /var/log/csm/errorlog/[CSM/MS hostname] log.2) If any messages are found, diagnose them by using the “Symptoms of problems” on page 154,

and the QLogic Troubleshooting Guide.c. Run the Fast Fabric Toolset health check.

1) On the fabric management server, run the /sbin/all_analysis command.

128 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 139: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

2) Check results in the /var/opt/iba/analysis/latest log. To interpret the results, use “Healthchecking” on page 135 and the Fast Fabric Toolset Users Guide.

5. If a problem is found, return to step 2 on page 128.

Runtime errorsUse this information to gain a high-level overview of runtime errors.

In an IBM high-performance computing (HPC) cluster, there are several methods for reporting runtimeerrors. For more details, see “Cluster fabric management flow” and “Servicing clusters” on page 149.

Key runtime problems include the following items:v IBM system runtime errors are reported to the Manage serviceable events task in the Hardware

Management Console (HMC) and include the appropriate field replaceable unit (FRU) lists.v Vendor switch runtime errors are first reported to the Subnet Manager and switch logs. If remote

logging and Cluster Systems Management (CSM) event management are set up, the errors are alsoreported on the CSM Management Server in /var/log/csm/errorlog/CSM/MS hostname. If remotelogging and CSM event management are not set up, you must query the fabric management serverlogs and the switches logs.

v If Fast Fabric health check is used, the output of the health check can also be used to report problems.You must either launch the health check manually, or us a script command to launch the health checkthrough a service, such as the cron tool.

Managing the cluster fabricUse this information to learn best practices and theory for the activities, applications, and tasks requiredfor cluster fabric management.

Cluster fabric management flowUse this information to gain an understanding of the tasks involved in managing the flow of the clusterfabric.

The following figure shows a typical flow of cluster fabric management activities from the point of asuccessful installation onward. As you work through the “Cluster fabric management tasks” on page 133,refer to this figure.

Clustering with high-performance computing by using InfiniBand hardware 129

Page 140: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Figure 14. Cluster fabric management flow

130 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 141: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Cluster fabric management components and their useLearn about the applications to use for cluster fabric management and how you will typically use them.

Use this information to understand the main cluster management subsystem components and the toolsthat you can use to manage the cluster in a scalable manner.

The Chassis Viewer and switch command line are not described. They are used mainly to manage andwork with one switch at a time. The QLogic documentation can help you understand their use. For moreinformation, see the Switch Users Guide, and the Best Practices for Clusters Guide.

The components of cluster fabric management include the following:

Cluster Systems ManagementCluster Systems Management (CSM) is used to loosely integrate the QLogic managementsubsystem with the IBM management subsystem.

QLogic subnet managerThe QLogic subnet manager configures and maintains the fabric.

QLogic Fast Fabric ToolsetThe Fast Fabric Toolset is a suite of management tools from QLogic.

QLogic performance managerYou typically access the performance manager indirectly. The Fabric Viewer is one tool to accessthe performance manager. The iba_report command on the Fast Fabric Toolset does not access theperformance manager to get link statistics.

Related concepts

“Management subsystem function overview” on page 14This information provides an overview of the servers, consoles, applications, firmware, and networks thatcomprise the management subsystem function.

Cluster Systems ManagementCluster Systems Management (CSM) is used to loosely integrate the QLogic management subsystem withthe IBM management subsystem.

CSM provides two major functions that can be used to manage the fabric.v Remote logging and event managementv Remote command processing

Use remote logging and event management to consolidate logs and serviceable events from the manycomponents in a cluster in one location, the Cluster Systems Management/Management Server(CSM/MS).

You can use remote command processing (the dsh command) to issue commands to the switches and thefabric management server (which runs the host-based subnet manager and Fast Fabric Toolset). With thiscapability, you can issue commands to these entities from the CSM/MS just as you can do to the nodes inthe cluster. You can do this interactively, or you can use the capability by writing scripts that enable thedsh command to access the switches and Fast Fabric Toolset. With remote command processing, you canrun monitoring or management scripts from the central location of the CSM/MS.

Clustering with high-performance computing by using InfiniBand hardware 131

Page 142: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Related concepts

“Monitoring fabric logs from CSM/MS” on page 134You can set up the Cluster Systems Management/Management Server (CSM/MS) to automaticallymonitor for problems.“Vendor log flow to CSM event management” on page 26The integration of vendor and IBM log flows is a critical factor in event management.“Remotely accessing QLogic management tools and commands from CSM/MS” on page 143The remote processing of QLogic management tools from Cluster Systems Management (CSM) can be animportant addition to the management infrastructure. It effectively integrates the QLogic managementenvironment with the IBM management environment.“Remotely accessing QLogic switches from CSM/MS” on page 144Remotely accessing switch commands from Cluster Systems Management (CSM) can be an importantaddition to the management infrastructure. It effectively integrates the QLogic management environmentwith the IBM management environment.Related tasks

“Setting up remote logging” on page 95Remote logging to the Cluster Systems Management/Management Server (CSM/MS) helps you monitorclusters by consolidating logs to a central location.“Setting up remote command processing” on page 103Use this procedure to set up remote command processing (dsh) commands from Cluster SystemsManagement (CSM) to the switches and to the fabric management server.

QLogic Fast Fabric ToolsetThe Fast Fabric Toolset is a suite of management tools from QLogic.

Fast Fabric commands and tools are listed in the following table.

Table 61. Recommended Fast Fabric tools and commands

Tool or command Comments

cmdall To issue command-line interface (CLI) commands to allswitches simultaneously.

Health check tools (for example, all_analysis,fabric_analysis)

Use health check tools to check for problems duringinstallation, problem determination, and repair. You canalso run them periodically to proactively check forproblems or unexpected changes to the network bycomparing current state and configuration with abaseline.

captureall Use this command to capture data for problemdetermination.

pingall Use this command to ping all the switch chassis on thenetwork to determine if they are accessible from thefabric management server.

ibtest Use this command primarily to update firmware and toreboot the management firmware switches (switchchassis management and embedded subnet manager).

iba_report Use this command to generate many different reports onall facets of fabric configuration and operation.

Fast Fabric Toolset menu Use Fast Fabric functions to access the Fast Fabric Toolsetmenu, which is a TTY menu. This can be especiallyhelpful in learning the power of Fast Fabric.

Important information about the Fast Fabric Toolset follows:

132 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 143: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

v Do not use Fast Fabric tools to manage the IBM servers and IBM host channel adapters (HCAs).Cluster Systems Management (CSM) is the appropriate tool for systems management in an IBMhigh-performance computing (HPC) cluster. In fact, many of the Fast Fabric tools for node managementare not useful in an IBM HPC cluster.

v The Fast Fabric Toolset runs on the fabric management server.v The Fast Fabric Toolset can only query host-based subnet managers that are on the same fabric

management server.v The Fast Fabric Toolset can only query subnets to which the fabric management server on which it is

running is connected. If you have more than four subnets, you will need to work with at least twodifferent fabric management servers to access all subnets.

v You must update the chassis configuration file with the list of switch chassis in the cluster. See“Installing the fabric management server” on page 91.

v You must update the ports configuration file with the list of HCA ports on the fabric managementserver. See “Installing the fabric management server” on page 91.

v Fast Fabric tools use the performance manager and other performance manager agents to collect linkstatistics for health checks and to collect the iba_report command results for fabric error checking.Therefore, performance manager must be enabled for such checks to be successful.

Use the QLogic Fast Fabric Toolset for managing and monitoring a cluster fabric. Refer to the Fast FabricToolset Users Guide for details about the commands to use. For more information about using Fast Fabrictools, see the QLogic Best Practices for Clusters Guide.

Cluster fabric management tasksUse this information to learn how to monitor critical cluster fabric components and how to maintainthem.

Table 62 lists the tasks you might want to perform and a reference to the appropriate procedure.

Table 62. Cluster fabric management tasks

Task Reference

To minimize IBM systems management effect on fabric

Restart the entire cluster “Restarting the cluster” on page 208

Restart one or a few servers “Restarting or powering off an IBM system” on page 209

To monitor the fabric

Monitor for general problems “Monitoring the fabric for problems” on page 134

Monitor for fabric-specific problems “Monitoring fabric logs from CSM/MS” on page 134

Manually query status of the fabric “Querying status” on page 143

Scripting to QLogic management tools and switches “Remotely accessing QLogic management tools andcommands from CSM/MS” on page 143

Run or update the baseline health check “Health checking” on page 135

Diagnose symptoms found during monitoring “Symptoms of problems” on page 154

Map IBM host channel adapter (HCA) device locations “Mapping of IBM HCA GUIDs to physical HCAs” onpage 162

To maintain and change the fabric

Code maintenance “Updating code” on page 145

Find and interpret configuration changes “Finding and interpreting configuration changes” onpage 147

Verify that new configuration changes are successful “Verifying repairs and configuration changes” on page207

Clustering with high-performance computing by using InfiniBand hardware 133

Page 144: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 62. Cluster fabric management tasks (continued)

Task Reference

Run or update baseline health check “Health checking” on page 135

Set up Cluster Systems Management (CSM) EventManagement for the fabric again

“Reconfiguring the CSM event management” on page196

Monitoring the fabric for problemsLearn how to monitor for problems in the fabric.

To monitor the fabric for problems, you can query logs on the Cluster Systems Management/Management Server (CSM/MS) or use health checks on the fabric management server. Both of thesemethods can be accomplished on the CSM/MS.

Note: There are also other error indicators that are used less frequently as backups. For details, see “Faultreporting mechanisms” on page 150.

Monitoring fabric logs from CSM/MSYou can set up the Cluster Systems Management/Management Server (CSM/MS) to automaticallymonitor for problems.

You can use the CSM and the Reliable Scalable Cluster Technology (RSCT) infrastructure to automate themonitoring of problems. However, this monitoring requires you to customize your environment. Toaccomplish this setup, see CSM Administration Guide and RSCT Administration Guide.

To set up the monitoring of fabric logs from the CSM/MS, you must have set up remote logging.

To check the fabric logs on CSM/MS, go to the /var/log/csm/errorlog/CSM hostname file. This filecontains log entries from switches and subnet managers that might point to serviceable events in thefabric. If this log contains entries, see the “Symptoms of problems” on page 154.

Check the audit log on the CSM/MS to see if information has been logged recently to the/var/log/csm/errorlog/CSM hostname file. The advantage of querying the audit log is that it gives youother information about the cluster. Use the lsevent command to query the audit log. More details aboutthe audit log are available in CSM and RSCT documentation.

Table 63. Common uses for the general event listing (lsevent) command

Command Description

Lsevent General event listing

lsevent | grep "string" Search for a specific string, or set of strings

lsevent -n [list of nodes] Search for records from a specific node

lsevent -B MMddhhmmyyyy Search based on a begin date

lsevent -E MMddhhmmyyyy Search based on an end date

lsevent -O x Get last "x" entires

Serviceable events that are created on the Hardware Management Console (HMC) and normally viewedby using the manage serviceable events task on the HMC can also be monitored on the CSM/MS. Toview the events on the CSM/MS, you must set up event monitoring as described in the CSMAdministration Guide. The CSM Administration Guide also describes how to interpret the log informationabout the CSM/MS. This log gives you the basic information about the hardware serviceable events,including which HMC to query for more details.

134 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 145: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Other fabric logs for engineering use might be stored in the /var/log/messages file. This file is stored ifyou set up the switches and fabric management servers to send INFO messages and above to theCSM/MS while you set up remote logging.Related tasks

“Setting up remote logging” on page 95Remote logging to the Cluster Systems Management/Management Server (CSM/MS) helps you monitorclusters by consolidating logs to a central location.

Health checkingHealth checking provides methods to check for errors and the overall health of the fabric.

Before setting up health checking, obtain the Fast Fabric Toolset Users Guide for reference.

There are several times that health checking is done. The method for interpreting results variesdepending on what you are trying to accomplish. The most generic health checking command available isthe all_analysis command. There are some underlying health-checking tools beneath the all_analysiscommand that are described in the Fast Fabric Toolset Users Guide. You can also target specific devices andports with these commands. This information is also documented in the Fast Fabric Toolset Users Guide.

Note: These commands must be processed on each fabric management server that has a master subnetmanager running on it.

The health checking commands should be run at various times as described in the following list:v During installation or reconfiguration to verify that there are no errors in the fabric and that the

configuration is as expected, repeatedly run the /sbin/all_analysis -b command until theconfiguration is correct.

v After an installation or repair is verified, a baseline health check is saved for future comparisons.Repairs that lead to serial number changes on field replaceable units (FRUs), movement of cables, orswitch firmware and software updates constitute configuration changes. The /sbin/all_analysis -bcommand should be run again.

v Periodically to monitor the fabric. For details, see “Setting up periodic fabric health checking” on page136. To periodically monitor the fabric, run the /sbin/all_analysis command.

Note: The LinkDown counter in the IBM GX/GX+ host channel adapters (HCAs) is reset as soon asthe link shuts down. This action is part of the recovery procedure. While this action is not optimal, theLinkDown counter for the connected switch port provides an accurate count of the number ofLinkDown actions for the link.

v To check link error counters without comparing against the baseline for configuration changes, use the/sbin/all_analysis –e command.

v During debug to query the fabric. This query can be helpful for performance problem debugging. Tosave the history during debugging, use the /sbin/all_analysis –s command.

v During repair verification to identify errors or inadvertent changes by comparing the latest healthcheck results to the baseline health check results.– To save history during queries: /sbin/all_analysis –s

– If the configuration is changed (with new part serial numbers), a new baseline is required. Use the/sbin/all_analyis –b command.

The following commands are important setup files for Fast Fabric Health Check. Details about how to setthem up are found in the Fast Fabric Toolset Users Guide.

Note: These commands must be changed on each fabric management server.v To see the basic setup file: /etc/sysconfig/fastfabric.confv For a list of switch chassis: /etc/sysconfig/iba/chassis

Clustering with high-performance computing by using InfiniBand hardware 135

Page 146: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

v For a list of switch chassis running embedded SM: /etc/sysconfig/iba/esm_chassisv For a list of ports on Fabric/MS: /etc/sysconfig/iba/ports (format equals “hca:port” and space

delimited)Related tasks

“Installing the fabric management server” on page 91The installation of the fabric management server is performed by the customer.

Setting up periodic fabric health checking:

Periodic fabric health checking should be set up to ensure that nothing has changed in the fabric thatmay affect performance.

The following information is based on setting up health checking that is performed at least once eachday.

When setting up regular health checking periods, you should set up the error thresholds appropriately inrelation to the frequency at which health checking is performed. Most default thresholds will not beaffected by this procedure. The key threshold to be changed is the symbol error threshold.

The threshold setting is difficult at this level, because the symbol error threshold being set is for everysymbol error that is detected on the link rather than some other time-based threshold in the hardware.Therefore, if you wish to set your own thresholds, read the rest of this topic so you can avoid calling outhealthy links as faulty.

There are two rules to be used with respect to an allowable number of symbol errors on a given link andalso throughout the entire fabric. Because the number of allowable errors on a given link is a worst-caseprobability, it is not acceptable to allow all links in a fabric to experience the maximum allowable numberof symbol errors, nor is it probable that a cluster will typically experience such a condition. Therefore, thenumber of allowable symbol errors in the entire fabric is based on a combination of factors involvingrecovery time for errors and the potential impact to the variability of fabric performance. The limit rulesare shown in the following list:1. A link should not experience more than 10 symbol errors in a given 24-hour time period.2. For any size cluster, there should be 432 or fewer symbol errors in a given 24-hour time period.

With the above rules in mind, two different query intervals (4 hours and 1 hour) will be addressed withguidance on how to set the thresholds.

At regular intervals, you need to clear the error counters, because the thresholds are not time-based, butsimple count-based thresholds. The error counter is cleared every 24 hours.

Monitoring for 24-hour cycles at 4-hour intervals:

To set up a 24-hour monitoring cycle at 4-hour intervals, complete the following steps.1. Save the original file by using the following command:

cp–p /etc/sysconfig/iba/iba_mon.conf /etc/sysconfig/iba/iba_mon.conf.original

2. Create a new file for each time period throughout a 24-hour cycle. Use this file to point to a specificthreshold for that time period. This will help reduce false callouts of suspected faulty links. Becauseyou need to refer to these files with the all_analysis script command, name them based on thetime period in which they will be used, such as iba_mon.conf.time period

3. Edit the file to update the symbol error threshold to the value in Table 64 on page 137. The default is100. Leave all other thresholds at their default values as shown in the following example:Number of Error CountersSymbolErrorCounter 100

136 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 147: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

For example, using Table 64for hour 12, you would have a file named iba_mon.conf.12, with thefollowing symbol error threshold setting:Number of Error CountersSymbolErrorCounter 5

4. Set up cron jobs to run the all_analysis command with different threshold files. For example, if youstart the 24-hour interval at 6 a.m., the crontab would look like the following example, which assumesthat the switch names begin with SilverStorm*, and shows that at 6 a.m., –C is used to reset thecounters:0 6 * * * FF_FABRIC_HEALTH=” -s -C -o errors -o slowlinks -Fnodepat:SilverStorm*” /sbin/all_analysis –c/etc/sysconfig/iba/iba_mon.conf.00 10 * * * /sbin/all_analysis –c /etc/sysconfig/iba/iba_mon.conf.40 14 * * * /sbin/all_analysis –c /etc/sysconfig/iba/iba_mon.conf.80 18 * * * /sbin/all_analysis –c /etc/sysconfig/iba/iba_mon.conf.120 22 * * * /sbin/all_analysis –c /etc/sysconfig/iba/iba_mon.conf.160 2 * * * /sbin/all_analysis –c /etc/sysconfig/iba/iba_mon.conf.20

Table 64. Symbol error thresholds (24-hour cycle with 4-hour intervals)

0 hour 4 hours 8 hours 12 hours 16 hours 20 hours

Count 10 3 4 5 7 9

Clear Yes No No No No No

Monitoring for 24-hour cycles at 1-hour intervals:

To set up a 24-hour monitoring cycle at 1-hour intervals, complete the following steps:1. Save the original file by using the following command:

cp–p /etc/sysconfig/iba/iba_mon.conf /etc/sysconfig/iba/iba_mon.conf.original

2. Create a new file for each time period throughout a 24-hour cycle. Use this file to point to a specificthreshold for that time period. This helps to reduce false callouts of suspected faulty links. Becauseyou need to refer to these files with the all_analysis script, name them based on the time period inwhich they will be used, for example iba_mon.conf.time period.

3. Edit the file to update the symbol error threshold to the value in Table 65 on page 138. The default is100. Leave all other thresholds at their default as shown in the following example:Number of Error CountersSymbolErrorCounter 100

For example, using Table 65 on page 138 for hour 12, you would have a file named iba_mon.conf.12,with the following symbol error threshold setting:Number of Error CountersSymbolErrorCounter 5

4. Set up cron jobs to run the all_analysis command with different threshold files. For example, if youstart the 24-hour interval at 6 a.m., the crontab would look like the following, which assumes that theswitch names begin with SilverStorm*, and shows that at 6 a.m., –C is used to reset the counters:0 6 * * * 'FF_FABRIC_HEALTH=” -s -C -o errors -o slowlinks -Fnodepat:SilverStorm*” /sbin/all_analysis –c/etc/syconfig/iba_mon.conf.0’0 7-11 * * * /sbin/all_analysis –c /etc/syconfig/iba_mon.conf.1-4’0 12-15 * * * /sbin/all_analysis –c /etc/syconfig/iba_mon.conf.8-110 18 * * * /sbin/all_analysis –c /etc/syconfig/iba_mon.conf.8-110 19-20 * * * /sbin/all_analysis –c /etc/syconfig/iba_mon.conf.13-140 21-22 * * * /sbin/all_analysis –c /etc/syconfig/iba_mon.conf.15-160 23 * * * /sbin/all_analysis –c /etc/syconfig/iba_mon.conf.17-190 0-1 * * * /sbin/all_analysis –c /etc/syconfig/iba_mon.conf.20-210 2-3 * * * /sbin/all_analysis –c /etc/syconfig/iba_mon.conf.20-210 4-5 * * * /sbin/all_analysis –c /etc/syconfig/iba_mon.conf.22-23

Clustering with high-performance computing by using InfiniBand hardware 137

Page 148: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 65. Symbol error thresholds (24-hour cycle with 1-hour intervals)

0 1 - 4 8 - 11 12 13 - 14 15 - 16 17 - 19 20 - 21 22 - 23

Count 10 3 4 5 6 7 8 9 10

Clear Yes No No No No No No No No

If you want to use different intervals, use the formula:Roundup((10 errs/24 hours)*(number of hours from 0))

Example for the 12th hour in a day:Roundup((10 errs/24 hours)*12) = 4

The minimum error threshold recommended is 3, because it is possible to get a burst of two errorswithin a short time and still have a healthy link.

Output files for health check:

Learn about the output files for the Fast Fabric health check.

The Fast Fabric health check output files are documented in the Fast Fabric Toolset Users Guide. Thefollowing information provides some of the key aspects of the output files:v The location of the output files is configurable in the /etc/sysconfig/fastfabric.conf file.v The default location of output files is /var/opt/iba/analysis/[baseline | latest | timestamp. The

$FF_ANALYSIS_DIR variable defines the output directory with the default of /var/opt/iba/analysis.v The Filename equals [type of health check].[fast fabric command].[suffix]

The commands for the output files are shown in the following list:– fabric: Basically subnet manager queries about fabric status– chassis: Switch chassis firmware queries– hostsm. Queries about subnet manager configuration– esm: Queries about embedded subnet manager configuration

v The Fast Fabric commands used by health check are detailed in the Fast Fabric Toolset Users Guide.v The suffixes for the output files are shown in the following list:

– .errors: Errors exist in fabric;

Note: LinkDown errors are only reported by the switch side of an IBM GX+ HCA to the switch link.– .diff: Change from baseline; see “Interpreting .diff files” on page 141.– .stderr: Error in operation of health check; contact your next level of support.

v All output files must be queried before taking a new baseline health check to ensure that the savedconfiguration information is correct.

v The all_analysis command is a wrapper for fabric_analysis, chassis_analysis, hostsm_analysis,and esm_analysis.

v The analysis routines use the iba_report file to gather information.v Key output files to check for problems follow:

– fabric*.links– fabric*.errors: Record the location of the problem. See “Diagnosing link errors” on page 175– chassis*.errors: Record the location of the problem. See “Symptoms of problems” on page 154.– *.diff: There is a difference from the baseline to the latest health check run. See “Interpreting .diff

files” on page 141.

While the following information is intended to be comprehensive in describing how to interpret thehealth check results, for the most recent information about health check, see the Fast Fabric Users Guide.

138 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 149: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

When any of the health check tools are run, the overall success or failure is indicated in the output of thetool and its exit status. The tool indicates which areas had problems and which files must be reviewed.The results from the latest run can be found in the $FF_ANALYSIS_DIR/latest/ directory. Many files can befound in this directory that indicate both the latest configuration of the fabric and indicate errors ordifferences found during the health check. Should the health check fail, use the following paragraphsdiscuss an order for to review these files.

If the -s option (save history) was used when running the health check, a directory whose name is thedate and time of the failing run is created under the FF_ANALYSIS_DIR directory, in which case, thatdirectory can be consulted instead of the latest directory shown in the following examples.

Review the results for any esm first (if using embedded subnet managers) or hostsm (if using host-basedsubnet managers) health check failures. If the subnet manager is incorrectly configured or not running, itcan cause other health checks to fail, in which case the subnet manager problems must be corrected first,and then the health check must be rerun and other problems must then be reviewed and corrected, andas needed.

For a hostsm analysis, review the files in the following order:

latest/hostsm.smstatusEnsure that this file indicates that the subnet manager is running. If no subnet managers arerunning on the fabric, that problem must be corrected before proceeding further. After beingcorrected, the health checks must be rerun to look for more errors.

latest/hostsm.smver.diffThis file indicates that the subnet manager version has changed. If this change is not expected,the subnet manager must be corrected before proceeding further. After being corrected, the healthchecks must be rerun to look for more errors. If the change is expected and is permanent, abaseline must be rerun after all other health check errors have been corrected.

latest/hostsm.smconfig.diffThis file indicates that the subnet manager configuration has changed. This file must be reviewed,and, as necessary, the latest/hostsm.smconfig file needs to be compared to thebaseline/hostsm.smconfig file. If necessary, correct the subnet manager configuration. After beingcorrected, the health checks needs be rerun to look for more errors. If the change is expected andis permanent, a baseline must be rerun once all other health check errors have been corrected.

For an esm analysis, the FF_ESM_CMDS configuration setting selects which ESM commands are used for theanalysis. When using the default setting for this parameter, the files must be reviewed in the followingorder:

latest/esm.smstatusEnsure that this file indicates that the subnet manager is running. If no subnet managers arerunning on the fabric, that problem must be corrected before proceeding further. After beingcorrected, the health checks must be rerun to look for more errors.

latest/esm.smShowSMParms.diffThis file indicates that the subnet manager configuration has changed. This file must be reviewed,and, as necessary, the latest/esm.smShowSMParms file needs to be compared to thebaseline/esm.smShowSMParms file. If necessary, correct the subnet manager configuration. Afterbeing corrected, the health checks must be rerun to look for more errors. If the change is expectedand is permanent, a baseline must be rerun once all other health check errors have beencorrected.

latest/esm.smShowDefBcGroup.diffThis file indicates that the subnet manager broadcast group for IPoIB configuration has changed.This file must be reviewed, and, as necessary, the latest/esm.smShowDefBcGroup file must becompared to the baseline/esm.smShowDefBcGroup file. If necessary, correct the subnet manager

Clustering with high-performance computing by using InfiniBand hardware 139

Page 150: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

configuration. After being corrected, the health checks needs to be rerun to look for more errors.If the change is expected and is permanent, a baseline must be rerun once all other health checkerrors have been corrected.

latest/esm.*.diffIf the FF_ESM_CMDS file has been changed, the changes in results for those additional commandsmust be reviewed. If necessary, correct the subnet manager configuration. After being corrected,the health checks must be rerun to look for more errors. If the change is expected and ispermanent, a baseline must be rerun once all other health check errors have been corrected.

Next, review the results of the fabric analysis for each configured fabric. If nodes or links are missing, thefabric analysis detects them. Missing links or nodes can cause other health checks to fail. If suchfailures are expected (for example, a node or switch is offline), further review of result files can beperformed. You must be aware that the loss of the node or link can cause other analyses to also fail.

The following information presents the analysis order for the fabric.0.0 file, If other or additional fabricsare configured for analysis, you must review the files in the order shown in the following list for eachfabric. There is no specific order for which fabric to review first.

latest/fabric.0.0.errors.stderrIf this file is not empty, it can indicate problems with the iba_report file (such as the inability toaccess an subnet manager), which can result in unexpected problems or inaccuracies in therelated errors file. If possible, problems reported in this file must be corrected first. After beingcorrected, the health checks must be rerun to look for more errors.

latest/fabric.0:0.errorsIf any links with excessive error rates or incorrect link speeds are reported, they must becorrected. If there are links with errors, be aware that the same links might also be detected inother reports such as the links and comps files.

latest/fabric.0.0.snapshot.stderrIf this file is not empty, it can indicate problems with the iba_report file (such as inability toaccess an subnet manager), which can result in unexpected problems or inaccuracies in therelated links and comps files. If possible, problems reported in this file must be corrected first.After being corrected, the health checks must be rerun to look for more errors.

latest/fabric.0:0.links.stderrIf this file is not empty, it can indicate problems with the iba_report file, which can result inunexpected problems or inaccuracies in the related links file. If possible, problems reported inthis file must be corrected first. After being corrected, the health checks must be rerun to look formore errors.

latest/fabric.0:0.links.diffThis file indicates that the links between components in the fabric have changed, removed, oradded, or that components in the fabric have disappeared. This file must be reviewed and, asnecessary, the latest/fabric.0:0.links file must be compared to the baseline/fabric.0:0.linksfile. If components have disappeared, review of the latest/fabric.0:0.comps.diff file might beeasier for such components. If necessary, correct missing nodes and links. After being corrected,the health checks must be rerun to look for more errors. If the change is expected and ispermanent, a baseline must be rerun once all other health check errors have been corrected.

latest/fabric.0:0.comps.stderrIf this file is not empty, it can indicate problems with the iba_report file which can result inunexpected problems or inaccuracies in the related comps file. If possible, problems reported inthis file must be corrected first. After being corrected, the health checks must be rerun to look formore errors.

latest/fabric.0:0.comps.diffThis file indicates that the components in the fabric or their Subnet Management Agent (SMA)configuration has changed. This file must be reviewed, and, as necessary, the

140 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 151: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

latest/fabric.0:0.comps file must be compared to the baseline/fabric.0:0.comps file. Ifnecessary, correct missing nodes, ports that are down, and incorrect port configurations. Afterbeing corrected, the health checks must be rerun to look for more errors. If the change is expectedand is permanent, a baseline must be rerun once all other health check errors have beencorrected.

Finally, review the results of the chassis_analysis file. If chassis configuration has changed, thechassis_analysis chassis_analysis, the FF_CHASSIS_CMDS, and FF_CHASSIS_HEALTH configuration settingsselect which chassis commands are used for the analysis. When using the default setting for thisparameter, the files must be reviewed in the following order:

latest/chassis.hwCheckEnsure that this file indicates all chassis are operating appropriately with the wanted power andcooling redundancy. If there are problems, they must be corrected, but other analysis files can beanalyzed first. Once any problems are corrected, the health checks must be rerun to verify thecorrection.

latest/chassis.fwVersion.diffThis file indicates the chassis firmware version has changed. If this change was not an expectedchange, the chassis firmware must be corrected before proceeding further. After correcting thefirmware version, rerun the health checks to look for more errors. If the change is expected and ispermanent, a baseline must be rerun once all other health check errors have been corrected.

latest/chassis.*.diffThese files reflect other changes to chassis configuration based on checks selected through theFF_CHASSIS_CMDS command. The changes in results for these remaining commands must bereviewed. If necessary, correct the chassis. After being corrected, the health checks must be rerunto look for more errors. If the change is expected and is permanent, a baseline must be rerunonce all other health check errors have been corrected.

If any health checks fail, after correcting the related problems, another health check must be run to verifythat all the problems are corrected. If the failures are due to expected and permanent changes, once allother errors have been corrected, a baseline must be rerun.

Interpreting .diff files:

Use this information to help you interpret the difference between the baseline health check and thecurrent health check.

If the results files of a Fast Fabric health check include a file named *.diff, then there is a differencebetween the baseline and the current health check. This file is generated by the health check comparisonalgorithm by using the diff command. The first file (file1) is the baseline file, and the second file (file2) isthe most recent file.

The default diff format that is used is shown with the context of one line before and after the altereddata. This format is the same as a diff –C 1. This command can be changed by entering your preferreddiff command and options by using the variable FF_DIFF_CMD in the fastfabric.conf file. For moredetails, see the Fast Fabric Toolset Users Guide. The following information shows that the default context isbeing used.

Entries similar to the following example are repeated throughout the *.diff file. These lines indicate howthe baseline (file1) differs from the latest (file2) health check.*** [line 1], [line 2] ****lines from the baseline file--- [line 1], [line 2] ----lines from the most recent file

Clustering with high-performance computing by using InfiniBand hardware 141

Page 152: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

The first set of lines enclosed in asterisks (*) indicates which line numbers contain the lines from thebaseline file that have been altered. The associated line numbers and data from the latest file follow.

Use the man diff command to get more details on diff file.

Several scenarios are shown in the following examples:

The following is an example of what might be seen when swapping two ports on the same host channeladapter (HCA).****************** 25,29 ****

10g 0x00025500000da080 1 SW IBM logical switch 1- <-> 0x00066a0007000ced 8 SW SilverStorm 9120 GUID=0x00066a00020001d9 Leaf 1,Chip A- 10g 0x00025500000da081 1 SW IBM logical switch 2

<-> 0x00066a00d90003d6 14 SW SilverStorm 9024 DDR GUID=0x00066a00d90003d630g 0x00025500000da100 1 CA IBM logical HCA 0

--- 25,29 ----10g 0x00025500000da080 1 SW IBM logical switch 1<-> 0x00066a00d90003d6 14 SW SilverStorm 9024 DDR GUID=0x00066a00d90003d6

+ 10g 0x00025500000da081 1 SW IBM logical switch 2+ <-> 0x00066a0007000ced 8 SW SilverStorm 9120 GUID=0x00066a00020001d9 Leaf 1,Chip A 30g 0x00025500000da100 1 CA IBM logical HCA 0

Use Table 66 to view detailed information about the differences of the baseline and the latest health checkinformation in the previous example.

Table 66. Differences between the baseline and current health check information after swapping two ports on thesame host channel adapter (HCA)

HCA Port Connected to switch port in baseline Connected to switch port in latest

0x00025500000da080 1 SWIBM logical switch 1

0x00066a0007000ced 8 SWSilverStorm 9120GUID=0x00066a00020001d9 Leaf 1,

0x00066a00d90003d6 14 SWSilverStorm 9024 DDRGUID=0x00066a00d90003d6

0x00025500000da0811 SW IBM logical switch 2

0x00066a00d90003d6 14 SWSilverStorm 9024 DDRGUID=0x00066a00d90003d6

0x00066a0007000ced 8 SWSilverStorm 9120GUID=0x00066a00020001d9Leaf 1,

The following example shows what might be seen after swapping two ports on the same switch:****************** 17,19 ****

10g 0x00025500000d8b80 1 SW IBM logical switch 1! <-> 0x00066a00d90003d6 15 SW SilverStorm 9024 DDR GUID=0x00066a00d90003d6

10g 0x00025500000d8b81 1 SW IBM logical switch 2--- 17,19 ----

10g 0x00025500000d8b80 1 SW IBM logical switch 1! <-> 0x00066a00d90003d6 14 SW SilverStorm 9024 DDR GUID=0x00066a00d90003d6

10g 0x00025500000d8b81 1 SW IBM logical switch 2****************** 25,27 ****

10g 0x00025500000da080 1 SW IBM logical switch 1! <-> 0x00066a00d90003d6 14 SW SilverStorm 9024 DDR GUID=0x00066a00d90003d6

10g 0x00025500000da081 1 SW IBM logical switch 2--- 25,27 ----

10g 0x00025500000da080 1 SW IBM logical switch 1! <-> 0x00066a00d90003d6 15 SW SilverStorm 9024 DDR GUID=0x00066a00d90003d6

10g 0x00025500000da081 1 SW IBM logical switch 2

142 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 153: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 67. Differences between the baseline and current health check information after swapping two ports on thesame switch.

Switch Port Connected to HCA port in baseline Connected to HCA port in latest

0x00066a00d90003d6 15 SWSilverStorm 9024 DDR

0x00025500000d8b80 1 SW IBMlogical switch 1

0x00025500000da080 1 SWIBM logical switch 1

0x00066a00d90003d6 14 SWSilverStorm 9024 DDR

0x00025500000d8b80 1 SW IBMlogical switch 1

0x00025500000da080 1 SWIBM logical switch 1

Querying status:

You can use several methods to query fabric status.

The following methods can be used to query the status of the fabric:v Check logs on the Cluster Systems Management/Management Server (CSM/MS) as described in

“Monitoring fabric logs from CSM/MS” on page 134.v Perform a Fast Fabric Toolset health check as described in “Health checking” on page 135.v Check the Manage Serviceable Events task on the Hardware Management Console (HMC) to see if

there are logs regarding the host channel adapters (HCAs) in field replaceable units (FRU) lists. Youcan also use monitoring as described in the CSM Administration Guide.

v Use the Fast Fabric Toolset iba_report command. See the Fast Fabric Toolset Users Guide for detailsabout the iba_report command. Many of the typical checks that you would do with the iba_reportcommand are done in the health check. However, you can do many more targeted queries by using theiba_report command. For more information, see “Tips: Using iba_report” on page 147.

v Use the Fast Fabric Toolset saquery command to complement the iba_report command. For moreinformation about the saquery command, see the Fast Fabric Toolset Users Guide.

v Use the Chassis Viewer to query one switch at a time. For more information, see the Switch UsersGuide.

v Use Fabric Viewer to obtain a graphical interface representation of the fabric. For more information, seethe Fabric Viewer Users Guide.

These methods may not include all the possible methods for querying status. Further information isavailable in the Switch Users Guide, the Fabric Manager and Fabric Viewer Users Guide, and the Fast FabricToolset Users Guide.

Remotely accessing QLogic management tools and commands fromCSM/MSThe remote processing of QLogic management tools from Cluster Systems Management (CSM) can be animportant addition to the management infrastructure. It effectively integrates the QLogic managementenvironment with the IBM management environment.

With remote command processing, you can do manual queries from the Cluster SystemsManagement/Management Server (CSM/MS) console without having to log in to the fabric managementserver. You can also write management and monitoring scripts that run from the CSM/MS, which canimprove the productivity for administration of the cluster fabric. You can write scripts to act on nodesbased on fabric activity or to act on the fabric based on node activity.

After you have set up remote command processing from the CSM/MS to the fabric management server,you can access any command that does not require user interaction by issuing the following dsh from theCSM/MS:dsh –d [fabric management server IP] [command list]

Clustering with high-performance computing by using InfiniBand hardware 143

Page 154: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Any typical dsh command string can be used. The fabric management server can be set up as a device inthe CSM/MS device database and not a node. If you set it up as a node, you must use a parameter otherthan -d to point to it.Related tasks

“Setting up remote command processing” on page 103Use this procedure to set up remote command processing (dsh) commands from Cluster SystemsManagement (CSM) to the switches and to the fabric management server.

Remotely accessing QLogic switches from CSM/MSRemotely accessing switch commands from Cluster Systems Management (CSM) can be an importantaddition to the management infrastructure. It effectively integrates the QLogic management environmentwith the IBM management environment.

With remote command processing, you can run manual queries from the Cluster SystemsManagement/Management Server (CSM/MS) console without logging into the switch. You can also writemanagement and monitoring scripts that run from the CSM/MS, which can improve the productivity forthe administration of the cluster fabric. You can write scripts to act on nodes based on switch activity orto act on switches based on node activity.

The CSM remote command capabilities that support QLogic switches follow:v The dsh command processing is supported for QLogic switches.v The updatehwdev command is supported for QLogic switches to transfer Secure Shell (SSH) keys from

the Management Server to the QLogic switch.v The dshbak command is supported for QLogic switches.v The dsh -z flag is supported for QLogic switches. It displays the exit status of the last remotely

processed command.v The device group is supported for QLogic switches.

The switches use a proprietary command-line interface (CLI). For CSM to work with the switch CLI,certain profiles need to be set up, and a dsh command parameter must be used to refer to the switchcommand profile. The following items highlight the important aspects in this setup:v Create the /var/opt/csm/IBSwitch/QLogic/config command definition file with the following

attributes:– SSH key exchange command for CLI: ssh-setup-command=sshKey add

– The dsh does not try to set environment: pre-command=NULL

– Last command for the return code: post-command=showLastRetcode -brief

v Set up the switches as devices with the following commands:– DeviceType=IBSwitch::Qlogic– RemoteShellUser=admin– RemoteShell=/usr/bin/ssh

v Define at least one device group for all switches by using the hwdefgrp command with the followingattributes:– DeviceType=='IBSwitch::Qlogic'

– Suggested Group name = IBSwitches

v Make sure that keys are exchanged by using updatehwdev command.

After you set up remote command processing from the CSM/MS to the switches, you can access anycommand to the switch that does not require user interaction by issuing the following dsh commandfrom the CSM/MS:dsh –d [switch device list] --devicetype IBSwitch::Qlogic [switch command]

144 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 155: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

You could also use a device grouping, which is a standard CSM technique.dsh –D IBSwitches --devicetype IBSwitch::Qlogic [switch command]

If you want to access switch commands that require user responses, the standard technique is to write anExpect script to communicate with the switch CLI.

You might want to remotely access switches to gather data or issue commands. You cannot work withinteractive commands through remote command processing.

You might choose to use some of the Fast Fabric Toolset command scripts that perform operations thatotherwise require user interaction on the switch CLI. In that case, you can still do remote commandprocessing from CSM, however, you must issue the command to the Fast Fabric Toolset on the fabricmanagement server.

The following methods are not supported in CSM with QLogic switches:v Processing the dsh command in interactive mode is not supported for QLogic switch.v Processing the dsh command in DSH context is not supported for non-node devices.v The rsh shell is not supported. Only SSH remote shell is supported.v Only one device type is supported in one dsh/updatehwdev command. (The updatehwdev -a/dsh -A

command is not supported when there is a mixture of user-defined device (QLogic) and commondevices defined.)

v The dcp command is not supported with the QLogic switches.Related tasks

“Setting up remote command processing” on page 103Use this procedure to set up remote command processing (dsh) commands from Cluster SystemsManagement (CSM) to the switches and to the fabric management server.

Updating codeLearn where to find documentation for each of the key code areas that affects the fabric.

The following table provides references and describes impacts for key code updates. In some cases, errorsmight be logged because links go down as a result of a server being restarted or powered on.

Table 68. Updating Code: References and Impacts

Code areas Reference Impact

Cluster Systems Management (CSM) CSM Administration Guide v CSM event management isinterrupted.

v A reboot operation interruptsremote logging.

IBM GX/GX+ host channel adapter(HCA) device driver

Code Release notes and operatingsystem manuals

v The fabric is impacted.

v A reboot operation causes links togo down and errors to be logged.

IBM system firmware and powerfirmware

Updates v Concurrent updates have noimpact.

v Nonconcurrent updates cause linksto go down and errors to belogged.

Clustering with high-performance computing by using InfiniBand hardware 145

Page 156: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 68. Updating Code: References and Impacts (continued)

Code areas Reference Impact

Fabric Manager (including SM) Fabric Manager Users Guide

Fast Fabric Toolset Users Guide

See “Updating Fabric Manager code.”

v Subnet recovery capabilities arelost during the update. If ahardware error occurs at this time,application performance mightsuffer.

Switch chassis management Switch Users Guide

Fast Fabric Toolset Users Guide

See “Updating switch chassis code.”

v The fabric is not impacted.

v If a hardware error occurs duringthis time, it is not reported unlessthe error still exists when the newcode comes up.

Updating Fabric Manager codeUse this information for guidance on updating Fabric Manager code.

The Fabric Manager code updates are documented in the Fabric Manager Users Guide, but the followingitems must be considered:v For the host-based Fabric Manager, use the instructions for updating the code found in the Fabric

Manager Users Guide.– Save the iview_fm.config file to a safe location so that if anything goes wrong during the

installation process you can recover this key file.– Use the remote command processing from Cluster Systems Management/Management Server

(CSM/MS) as in the following example, where c171opsm3 is the fabric management server address.The cd that precedes the installation command is required so that the command is run from thecorrect path.dsh -c -d c171opsm3 ’cd/root/infiniserv_software/InfiniServMgmt.4.1.1.0.15; ./INSTALL -i mpi’

v For embedded subnet managers, the Fast Fabric Toolset can be used to update the code across allswitches simultaneously by using the ibtest command. For more information, see the Fast Fabric ToolsetUsers Guide. If you only need to update code on one switch, you can use the Chassis Viewer. For moreinformation, see the Switch Users Manual.– You need to place the new embedded subnet manager code on the fabric management server.– If you have multiple primary fabric management servers, you can issue the ibtest command from

CSM/MS by using the dsh command to all of the primary fabric management serverssimultaneously. This capability must be set up using remote command processing.

Related tasks

“Setting up remote command processing” on page 103Use this procedure to set up remote command processing (dsh) commands from Cluster SystemsManagement (CSM) to the switches and to the fabric management server.

Updating switch chassis codeUse this information for guidance on updating the switch chassis code.

The switch chassis management code is embedded firmware that runs in the switch chassis. The FabricManager code updates are documented in the Fabric Manager Users Guide, but the following items mustbe considered:v For the switch chassis management code, the Fast Fabric Toolset can update the code across all

switches simultaneously by using the ibtest command. For more information, see the Fast Fabric ToolsetUsers Guide.

146 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 157: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

v If you only need to update code on one switch, you can do this by using the Chassis Viewer. Fordetails, see the Switch Users Manual.

v You need to place the switch chassis management code on the fabric management server.v If you have multiple primary fabric management servers, you can issue the ibtest command from

Cluster Systems Management/Management Server (CSM/MS) by using the dsh command to all theprimary fabric management servers simultaneously. This capability must be set up by using remotecommand processing.

Related tasks

“Setting up remote command processing” on page 103Use this procedure to set up remote command processing (dsh) commands from Cluster SystemsManagement (CSM) to the switches and to the fabric management server.

Finding and interpreting configuration changesThis information can be used to find and interpret configuration changes by using the Fast Fabric HealthCheck tool.

Configuration changes are best found by using the Fast Fabric Health Check tool. For more information,see “Health checking” on page 135.

Note: If you have multiple primary fabric management servers, you must run the health check on eachprimary server, because Fast Fabric can only access subnets to which its server is attached. You mightconsider using the Cluster Systems Management/Management Server (CSM/MS) to remotely process thisfunction on all primary fabric management servers.

At the end of the installation process, a baseline health check should be taken to have a comparison ofthe current configuration with a known good configuration. For more information, see “Reestablishing ahealth check baseline” on page 207. Comparison results reveal configuration changes.

After performing a current health check by using the all_analysis command, go to the analysis directoryat /var/opt/iba/analysis/latest, and look for files ending in .diff. To determine what information iscontained within each file, use the Fast Fabric Toolset Users Guide. This information helps you to determinewhat has changed.

If nothing has changed, you need to change back to the original configuration.

If the configuration changes are valid, take a new baseline by reestablishing a health check baseline.Related concepts

“Remotely accessing QLogic management tools and commands from CSM/MS” on page 143The remote processing of QLogic management tools from Cluster Systems Management (CSM) can be animportant addition to the management infrastructure. It effectively integrates the QLogic managementenvironment with the IBM management environment.Related tasks

“Reestablishing a health check baseline” on page 207After changing the fabric configuration, use this procedure to reestablish a health check baseline.

Tips: Using iba_reportThe iba_report function helps you to monitor the cluster fabric resources.

Under most monitoring circumstances, you can rely on health checks as described in “Health checking”on page 135. However, you might want to do some advanced monitoring by using the iba_reportcommand on the fabric management server.

Clustering with high-performance computing by using InfiniBand hardware 147

Page 158: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Some suggested parameters are in the following table. You can use these parameters with the iba_reportcommand to get detailed information. Some examples of uses for the iba_report command follow thetable. This information is not meant to provide complete coverage of the iba_report command. Instead, itprovides a few examples to illustrate how the iba_report command might be used for detailedmonitoring of cluster fabric resources. For more information, see the QLogic Fast Fabric Users Guide.

Table 69. Suggested iba_report parameters

Parameter Description

-d 10 This parameter provides extra detail that you would notsee at the default detail level of 2.

You may find it useful to experiment with the detaillevel when you develop a query. Quite often –d 5 is themost detail that you can extract from a given command.

-s This parameter includes statistical counters in the report.

-i [seconds] This parameter causes a query to statistical counters afterwaiting the number of seconds specified in theparameter. Quite often this is used along with –C to clearthe counters. This parameter implies the –s parameter

-F [focus info] You can focus the iba_report command on a singleresource or group of resources that match the filterdescribed in the focus info value.

See the Fast Fabric Users Guide for details on the manydifferent filters that you can use, such as the followingfilters:

v portguid

v nodeguid

v nodepat: For patterns to search for.

-h [hca] and –p [port] Used with each other, these parameters point the tool todo the query on a specific subnet connected to theindicated hca and the port on the fabric managementserver. The default is the first port on the first hostchannel adapter (HCA).

-o slowlinks This parameter looks for links that are slower thanexpected

-o errors This parameter looks for links that exceed the allowederror threshold. See the Fast Fabric Users Guide for detailsabout error thresholds.Note: The LinkDown counter in the IBM GX/GX+HCAs is reset as soon as the link goes down. Thisprocess is part of the recovery procedure. While this isnot optimal, the connected switch port's LinkDowncounter provides an accurate count of the number ofLinkDown occurrences for the link.

-o misconnlinks This parameter shows a summary of links connectedwith mismatched speed.

-o links This parameter shows summary of links, including towhat they are connected.

Note: The iba_report command is run on a subnet basis. If you want to gather data from all subnetsthat are attached to a fabric management server, a typical technique is to use nested loops to address thesubnets through the appropriate HCAs and ports to reach all subnets. For example:

148 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 159: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

for h in 1 2; do for p in 1 2; do iba_report –o errors –F“nodepat:SilverStorm*”; done; done

Examples:

All the following examples simply query over the first port of the first HCA in the fabric managementserver. You need to use –p and –h to direct the commands over a particular HCA port to reach theappropriate subnet.iba_report -o comps -d 10 -i 10 -F portguid:0x0002550070011a00

The command gets the comps report 10 seconds after clearing the counters for the portguid:0x0002550070011a00. The –d parameter being set to 10 gives enough detail to include the port trafficcounter statistics. You might use this command to watch the traffic out of a particular HCA. In this case,the portguid address is an IBM GX++ HCA. See “Mapping of IBM HCA GUIDs to physical HCAs” onpage 162 for commands that can help you determine HCA GUIDs. In this case, the GUID of concern isassociated with a specific port of the HCA. While the HCA tracks most of the prescribed counters, it doesnot have counters for transmit packets or receive packets.iba_report -o route -D nodeguid:<destination NodeGUID> -S nodeguid:<source NodeGUID>

The previous command queries the state of the routes from one node on the fabric to another.

Note: Node is used in the sense of a node on the fabric, not in the sense of a logical partition or a server.To find the node GUIDs, see “Mapping of IBM HCA GUIDs to physical HCAs” on page 162. Instead ofdoing as instructed and grepping for only the first 7 bytes of a node GUID, consider recording all 8 bytes.You can use the iba_stat –n command for HCAs in AIX logical partitions and the ibv_devinfo –vcommand for HCAs in Linux logical partitions.

If you have a particular logical partition for which you want to determine routes, you could use theportGUID command instead:iba_report -o route -D portguid:<destination portGUID> -S nodeguid:<port NodeGUID>

iba_report -d 5 -s -o nodes –F 'nodepat:IBM*Switch*’

The previous query gets node information with enough details to also get the port counters. The focus ison any IBM logical switch, which is the basis for the IBM GX HCAs. This query matches any generationof IBM GX HCA that happens to be in the cluster.

Note: While the HCA tracks most of the prescribed counters, it does not have counters for transmitpackets or receive packets.

To clear all the port statistics on all switch chassis, use the following command:iba_report –C –o none

The query returns nothing, but all the port statistics on all switch chassis are cleared.

Servicing clustersServicing a cluster requires an understanding of how problems are reported, who is responsible foraddressing service problems, and the procedures used to fix the problems.

Additional information about servicing your cluster environment can be found in the “Clusterinformation resources” on page 3.

Getting started with servicing clustersThere are many things to consider when servicing clusters.

Clustering with high-performance computing by using InfiniBand hardware 149

Page 160: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Responsibilities for servicing clustersServicing the cluster requires the coordinated efforts of IBM service representatives, customers, and theswitch vendor.

The responsibility for servicing a cluster is dependent on the parts being serviced. The followinginformation shows the general responsibilities for servicing the cluster:v IBM service representatives are responsible for servicing IBM parts that are not customer replaceable

units (CRUs).v The customer is responsible for servicing IBM CRUs.v The customer or the vendor is responsible for servicing vendor switches and cables, unless otherwise

contracted.

Fault reporting mechanismsProblems with the cluster can be identified through several mechanisms that are part of the managementsubsystem.

Faults (problems) can be surfaced through the fault reporting mechanisms found in the following table.

Table 70. Fault reporting mechanisms

Reporting mechanism Description

Cluster Systems Management (CSM) event managementfabric log

The CSM event management fabric log is used tomonitor and consolidate Fabric Manager and switcherror logs in one location.

This log is located on the Cluster SystemsManagement/Management Server (CSM/MS) in thefollowing file:

/var/log/csm/errorlog/CSM/MS hostname

CSM audit log This log is part of the standard event managementfunction. It is accessed by using the lsevent command. Itis a summary point for Reliable Scalable ClusterTechnology (RSCT) and CSM event management. It canhelp point to activity in the /var/log/csm/errorlog fileand serviceable events on the Hardware ManagementConsole (HMC).

Hardware light emitting diodes (LEDs) The switches and host channel adapters (HCAs) haveLEDs.

Manage serviceable events task This task is the standard reporting mechanism for IBMPower Systems servers that are managed by HMCs.

Chassis viewer LED This user interface runs on the switch and is accessiblefrom a Web browser. It provides virtual LEDs thatrepresent the switch hardware LEDs.

Fast Fabric Toolset The Fast Fabric Toolset reports fabric problems in twoways. The first is from a report output. The other is in ahealth check output.

Customer reported problem This action is any problem that the customer reportswithout using any of the reporting mechanisms.

Fabric viewer This user interface provides a view into current fabricstatus.

The following logs typically do not have to be accessed when remote logging and CSM Event Management areenabled. However, sometimes they must be captured for debugging purposes.

150 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 161: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 70. Fault reporting mechanisms (continued)

Reporting mechanism Description

Fabric notices log on CSM/MS This intermediate log is where notice or higher severitylog entries from switches and subnet managers arereceived through the syslogd command on the CSM/MS.

This log is located on the CSM/MS in the following file:

/var/log/csm/errorlog/syslogd.fabric.notices

This log is a pipe on a Linux CSM/MS and cannot beviewed normally. Reading from the pipe causes eventmanagement to lose events.

Information log on CSM/MS This log is an optional intermediate log where info orhigher severity log entries from switches and subnetmanagers are received through the syslogd command onthe CSM/MS.

This log is located on the CSM/MS in the following file:

/var/log/csm/errorlog/syslogd.fabric.info

Switch log This log includes any errors reported by the chassismanager (for example, internal switch chassis problemssuch as power and cooling, or logic errors.)

This log is accessed through the switch command-lineinterface (CLI) or Fast Fabric tools.

/var/log/messages on fabric management server This log is the syslog command on the fabricmanagement server where host-based subnet managerlogs reside. This is the log for the entire fabricmanagement server; therefore, there might be entries in itfrom components other than the subnet manager.

Related concepts

“Management subsystem function overview” on page 14This information provides an overview of the servers, consoles, applications, firmware, and networks thatcomprise the management subsystem function.“Vendor log flow to CSM event management” on page 26The integration of vendor and IBM log flows is a critical factor in event management.“Monitoring fabric logs from CSM/MS” on page 134You can set up the Cluster Systems Management/Management Server (CSM/MS) to automaticallymonitor for problems.

Fault diagnosis approachDiagnosing problems can be accomplished in multiple ways.

Several methods can be used for fault diagnosis on your cluster environment. The following faultdiagnosis methods are intended to supplement the information in the “Symptoms of problems” on page154.

Types of fabric events:

Use this information to find out about the most common events that affect the fabric and how theseevents might be reported and interpreted.

Fabric problems can be categorized in the following types:

Clustering with high-performance computing by using InfiniBand hardware 151

Page 162: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Link problemsLink problems can be reported through remote logging to the Cluster SystemsManagement/Management Server (CSM/MS) in the /var/log/errorlog/CSM/MS hostname file bythe subnet manager. Without remote logging, you must query the subnet manager log directly.v If a single link is failing, this method isolates the problem to a switch port, the other side (host

channel adapter (HCA) or another switch port), and a cable.v If multiple links are failing, a pattern might be discernible that directs you to a common field

replaceable unit (FRU), such as an HCA, a switch leaf card, or a switch spine.

Internal failureThe internal failure of a switch spine or leaf card manifests as either multiple link failures, or lossof communication between the device and the management module. Internal failures are reportedthrough remote logging to the CSM/MS in the /var/log/errorlog/CSM/MS hostname file. Withoutremote logging, you must query the switch log.

Redundant switch FRURedundant switch field replaceable unit (FRU) failures are reported through the syslog and intothe CSM event management subsystem. The syslog indicates the failing FRU. For switches, thisincludes power supplies, fans, and management modules. Redundant switch FRU failures arereported through remote logging to the CSM/MS in the /var/log/errorlog/CSM/MS hostname file.Without remote logging, you must query the switch log.

User inducedUser-induced link failure events are caused by a person unplugging a cable for a repair, poweringoff a switch or server, or restarting a server. Any link event must first be correlated to any useractions that might be the root cause. The user-induced event might not be reported anywhere. Ifa cable is unplugged, it is not reported. If a server is restarted or powered off, the server logsrecord the event. The link failure caused by the user is reported through remote logging to theCSM/MS in the /var/log/errorlog/CSM/MS hostname file. Without remote logging, you mustquery the subnet manager log.

HCA failuresHost channel adapter (HCA) failures are reported to the Manage serviceable events task on theHardware Management Console (HMC) that is managing the system and is forwarded to CSMservice monitoring. Any link event must first be correlated to any existing HCA failures thatmight be the root cause. The link event caused by the user is reported through remote logging tothe CSM/MS in the /var/log/errorlog/CSM/MS hostname file. Without remote logging, you mustquery the subnet manager log.

Server failuresServer failures are reported to the Manage serviceable events task on the HMC that is managingthe system and is forwarded to CSM service monitoring. Any link event must first be correlatedto any existing server failures that might be the root cause.

Performance problemsPerformance problems are typically reported by users. Unless one of the previous failurescenarios is identified as the root cause, you need to check the health of the fabric to eitheridentify an unreported problem, or to positively verify that the fabric is in good health. Althoughperformance problems can be complex and require remote support, some initial diagnosis can beperformed by using the procedure in “Diagnosing performance problems” on page 187.

Application crashesApplication crashes are typically reported by users. There are many causes for application crashesthat are outside the scope of this information. However, some initial diagnosis can be performedby using the procedure in “Diagnosing application crashes” on page 188.

Configuration changesConfiguration changes are typically reported by Fast Fabric Health Check. Configuration changes

152 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 163: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

can be caused by many things; some are benign and some indicate a real problem. For moredetails, see “Diagnosing configuration changes” on page 178.

Examples of configuration changes are shown in the following list:v Inadvertently moving a cable or swapping componentsv Replacing a part with one that has a different serial numberv Leaving a device powered offv Causing a device to be unreachable because of a link failurev Changing the firmware level

Isolating link problems:

Use this information to isolate InfiniBand fabric problems.

When you are isolating InfiniBand fabric problems, you need to check log entries that are a few minutesbefore and after the event you are diagnosing to determine whether these events are associated with itand which of the entries might be the root cause.

The general InfiniBand isolation flow follows. For a detailed procedure, see “Diagnosing link errors” onpage 175.1. Within a few minutes before or after an event, see how many other events are reported.2. If there are multiple link errors, first check for a common source. This can be complex if

nonassociated errors are reported at about the same time. For example, if a host channel adapter(HCA) fails, and a switch link fails that is not connected to the HCA, you must be careful to notassociate the two events.a. Map all link errors so that you can determine which switch devices and which HCAs are

involved. You might have to map HCA GUIDs to physical HCAs and to the servers in which theyare populated so you can check for adapter errors. To check for the errors, look for serviceableevents on the Hardware Management Console (HMC) that might have caused link errors. Formapping of HCAs, see “Mapping of IBM HCA GUIDs to physical HCAs” on page 162.

b. Look for a switch internal error in the /var/log/csm/errorlog/[CSM/MS hostname] file. This filecontains possible serviceable events from all the Fabric Manager and switch logs in the cluster.

c. Look for an internal error on an HCA on the HMC. This error might bring a link down.d. Look for a server check stop on the HMC. This check stop might bring a link down.e. Map all internal errors to associated links by completing the following steps:

1) If there is a switch internal error, determine the association based on whether the error isisolated to a particular port or leaf card, or to the spine.

2) If there is an adapter error or server check stop, determine the switch links to which they areassociated.

f. If there are no HCA or server events reported on the HMC, and you know there was nothingrestarted that could have caused the event, and the link errors span more than one HCA, then theproblem is likely to be in the switch.

g. If neighboring links on the same HCA are failing, it could be the HCA that is faulty. Links on IBMHCAs are in pairs. If the HCA card has four links, then T1 and T2 are a pair and T3 and T4 are apair.

3. If there are link problems, you might have to isolate the problem by using cable swapping techniquesto see how errors follow the cables. This might affect another link that is good. If you swap cables,you might see errors reported against the links on which you are operating.

4. After making repairs, complete the procedure in “Verifying link FRU replacements” on page 207.

Clustering with high-performance computing by using InfiniBand hardware 153

Page 164: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Related concepts

“Managing serviceable events on the HMC” on page 23

Scenarios: Restarting or powering on:

Use this information to learn about the impact of rebooting the server and power-on scenarios on thefabric. Theses scenarios pose a potential problem in masking a real failure.

When you restart your server, it might ignore all link errors around the time of the restart action. Anyunassociated link failures must occur again before the problem is recognized. To avoid this problem, usethe procedure in “Restarting the cluster” on page 208 or in “Restarting or powering off an IBM system”on page 209.

Network Time Protocol:

Use this information to learn about the importance of configuring Network Time Protocol (NTP) on theservice and cluster virtual local area networks (VLANs).

Fabric diagnosis is dependent on Network Time Protocol (NTP) service for all devices in the cluster. TheNTP provides the correct correlation of events based on time. Without NTP, timestamps can varysignificantly and cause difficulty in associating events.

Symptoms of problemsUse the symptom tables to diagnose problems that are reported against the fabric.

A separate table is shown for each reporting mechanism, in which the symptom is cross-referenced to anisolation procedure.

The following table is used for events reported in the Cluster Systems Management/Management Server(CSM/MS) Fabric Event Management Log (/var/log/csm/errorlog/CMS/MS hostname on the CSM/MS).The CSM audit log might point to that file. Furthermore, it is a reflection of switch logs and subnetmanager logs, so this table could be used for switch logs and subnet manager logs as well.

For details on how to interpret the logs, see “Interpreting switch log formats from vendors” on page 172.

Before performing procedures in any of these tables, familiarize yourself with the information providedin “Servicing clusters” on page 149), which provides general information about diagnosing problems aswell as about the service subsystem.

Table 71. CSM/MS Fabric Event Management log symptoms

Symptom Procedure or reference

Switch chassis management logs

(The log has CHASSIS: string in the entry)

Switch chassis log entry See the Switch Users Manual and contact QLogic.

Subnet manager logs

(The log has SM: string in the entry)

Link down See “Diagnosing link errors” on page 175.

Link integrity or symbol errors on host channel adapter(HCA) or switch ports

See “Diagnosing link errors” on page 175.

Switch disappears See the Switch Users Guide and contact switch serviceprovider.

Switch port disappears See “Diagnosing link errors” on page 175.

154 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 165: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 71. CSM/MS Fabric Event Management log symptoms (continued)

Symptom Procedure or reference

Logical switch disappears See “Diagnosing link errors” on page 175.

Logical HCA disappears See “Diagnosing link errors” on page 175.

Fabric initialization errors on an HCA or switch port See “Diagnosing link errors” on page 175.

Fabric initialization errors on a switch See the Switch Users Manual and contact the switchservice provider.

Then use “Diagnosing and repairing switch componentproblems” on page 178.

Security errors on switch or HCA ports Contact your next level of support.

If anything is done to change the hardware or softwareconfiguration for the fabric, use “Reestablishing a healthcheck baseline” on page 207.

Other exceptions on switch or HCA ports Contact your next level of support.

If anything is done to change the hardware or softwareconfiguration for the fabric, use “Reestablishing a healthcheck baseline” on page 207.

Events where the subnet manager (SM) is the noderesponsible for the problem

First check for problems on the switch or the server onwhich the subnet manager is running. If there are noproblems there, contact QLogic.

If anything is done to change the hardware or softwareconfiguration for the fabric, use “Reestablishing a healthcheck baseline” on page 207.

The following table is used for any symptoms observed by using hardware light emitting diodes (LEDs)on HCAs and switches. These include switch LEDs that are virtualized in the Chassis Viewer.

Table 72. Hardware or Chassis Viewer LEDs symptoms

Symptom Procedure or reference

LED is not lit on switch port See “Diagnosing link errors” on page 175.

LED is not lit on HCA port See “Diagnosing link errors” on page 175.

Red LED that is not on a switch port or HCA See the Switch Users Guide and the QLogic TroubleshootingGuide.

Then use “Diagnosing and repairing switch componentproblems” on page 178.

Other switch LED conditions on nonport LEDs See the Switch Users Guide and the QLogic TroubleshootingGuide.

Then use “Diagnosing and repairing switch componentproblems” on page 178.

Other HCA LED conditions See the IBM systems service information.

Then use “Diagnosing and repairing IBM systemproblems” on page 178.

The following is a table of symptoms for problems that are reported by Fast Fabric tools. Health checkfiles are found by default on the fabric management server in the /var/opt/iba/analysis/baseline|latest|<savedate>. See the Fast Fabric Toolset Users Guide for details.

Clustering with high-performance computing by using InfiniBand hardware 155

Page 166: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 73. Fast Fabric Tools symptoms

Symptom Procedure or reference

Health check file: fabric*link.errors Record the location of the errors and see “Diagnosinglink errors” on page 175.

Health check file: fabric*comps.errors 1. Record the location of the errors.

2. See the Fast Fabric Toolset Users Guide for details

3. If this file refers to a port, see “Diagnosing linkerrors” on page 175. Otherwise, see “Diagnosing andrepairing switch component problems” on page 178.

Health check file: chassis*.errors 1. Record the location of the errors.

2. See the Fast Fabric Toolset Users Guide for details.

3. If a switch component is repaired, see “Diagnosingand repairing switch component problems” on page178.

Health check file: fabric.*.links.diff

Speed or width change indicated

Record the location of the change and see “Diagnosinglink errors” on page 175.

Health check file indicates configuration change:

fabric.*.diff

chassis*.diff

esm*.diff

hostsm*.diff file

1. Record the location of the changes.

2. See the Fast Fabric Toolset Users Guide for details.

3. If the change is expected, perform “Reestablishing ahealth check baseline” on page 207.

4. If the change is not expected, perform “Diagnosingconfiguration changes” on page 178.

Health check file indicates firmware change:

chassis*.diff

esm*.diff

hostsm*.diff file

1. Record the location of the changes.

2. See the Fast Fabric Toolset Users Guide for details.

3. If the change is expected, perform “Reestablishing ahealth check baseline” on page 207.

4. If the change is not expected, perform “Updatingcode” on page 145.

Health check *.stderr file This file indicates a problem with health checking.

Check the link to the subnet.

Check the cluster virtual local area network (VLAN) forproblems.

Use “Capturing problem data for Fabric Manager andFast Fabric software” on page 161.

Contact your next level of support for QLogic softwareproblems.

Error reported on a link from health check or iba_report See “Diagnosing link errors” on page 175.

The following table is used for symptoms found in the Manage serviceable events task on the HMC, orfound by CSM monitoring.

156 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 167: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 74. Manage serviceable events

Symptom Procedure reference

Any code Use the IBM system service information.

Then use “Diagnosing and repairing IBM systemproblems” on page 178.

The following table is used for any symptoms reported outside of the previous reporting mechanisms.

Table 75. Other symptoms

Symptom Procedure or reference

Performance problem reported See “Diagnosing performance problems” on page 187.

Application crashes – relative to the fabric See “Diagnosing application crashes” on page 188.

Management Subsystem problems (including unreportederrors)

See “Diagnosing management subsystem problems” onpage 189.

HCA preventing a logical partition from activating See Recovering from an HCA preventing a logicalpartition from activating.

Ping problems See “Diagnosing and recovering ping problems” on page188.

Not running at wanted 4 KB maximum transfer unit(MTU)

See “Recovering to 4 KB maximum transfer units in theAIX operating system” on page 201.

Bad return codes or software failure indicators for FabricManager or Fast Fabric Software

Check the link to the switch.

See “Capturing problem data for Fabric Manager andFast Fabric software” on page 161.

Contact your next level of support for QLogic softwareproblems.

Finding the appropriate service procedureUse this information to find service procedures to address specific tasks.

The following table lists the common service procedures. Use this table if you have a particular type ofservice task in mind. These service procedures refer to service procedures and information in otherdocuments; however, considerations that are unique to clusters are highlighted in these procedures.

If you are trying to diagnose a symptom, begin with the “Symptoms of problems” on page 154 beforeproceeding with this table.

Table 76. Service procedures

Task Procedure

Special procedures

Restarting the cluster “Restarting the cluster” on page 208

Restarting or powering off an IBM system “Restarting or powering off an IBM system” on page 209

Getting debug data from switches and subnet managers “Capturing data for fabric diagnosis” on page 159

Using the script command while collecting switchinformation

“Capturing switch CLI output” on page 161

Mapping fabric devices to physical locations “Mapping fabric devices” on page 162

Setting an already installed cluster to run at 4 KBmaximum transfer unit (MTU)

Clustering with high-performance computing by using InfiniBand hardware 157

Page 168: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 76. Service procedures (continued)

Task Procedure

Counting the number of fabric devices “Counting devices” on page 210

Preparing for smoother handling of emergency power off(EPO) situations

“Handling emergency power off situations” on page 213

Setting up Cluster Systems Management (CSM) EventManagement for the fabric again.

“Reconfiguring the CSM event management” on page 196

Monitoring procedures

Best practice for monitoring the fabric “Monitoring fabric logs from CSM/MS” on page 134

General monitoring for problems “Monitoring the fabric for problems” on page 134

Diagnostic procedures

How faults are reported “Fault reporting mechanisms” on page 150

Diagnosing symptoms “Symptoms of problems” on page 154

Capturing data for fabric diagnosis “Capturing data for fabric diagnosis” on page 159

Capturing data for Fabric Manager or Fast Fabricsoftware problem

“Capturing problem data for Fabric Manager and FastFabric software” on page 161

Mapping devices from reports to physical devices “Mapping fabric devices” on page 162

Interpreting the switch vendor log formats “Interpreting switch log formats from vendors” on page172

Diagnosing link errors “Diagnosing link errors” on page 175

Diagnosing switch internal problems “Diagnosing and repairing switch component problems”on page 178

Diagnosing IBM system problems “Diagnosing and repairing IBM system problems” onpage 178

Diagnosing configuration changes from health check “Diagnosing configuration changes” on page 178

Diagnosing performance problems “Diagnosing performance problems” on page 187

Diagnosing application crashes “Diagnosing application crashes” on page 188

Look for swapped host channel adapter (HCA) ports “Diagnosing swapped HCA ports” on page 186

Look for swapped ports on switches “Diagnosing swapped switch ports” on page 187

Diagnosing management subsystem problems “Diagnosing management subsystem problems” on page189

Ping problems “Diagnosing and recovering ping problems” on page 188

Repair procedures

Recovering from an HCA preventing a logical partitionfrom activating

Recovering from an HCA preventing a logical partitionfrom activating

Repairing IBM systems “Diagnosing and repairing IBM system problems” onpage 178

Ping problems “Diagnosing and recovering ping problems” on page 188

Recovering ibX interfaces “Recovering ibx interfaces” on page 199

Not running at wanted 4 KB MTU “Recovering to 4 KB maximum transfer units in the AIXoperating system” on page 201

Reestablishing a health check baseline “Reestablishing a health check baseline” on page 207

Verify procedures

Verifying link field replaceable unit (FRU) replacements “Verifying link FRU replacements” on page 207

158 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 169: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 76. Service procedures (continued)

Task Procedure

Verifying other repairs “Verifying repairs and configuration changes” on page207

Verifying configuration changes “Verifying repairs and configuration changes” on page207

Capturing data for fabric diagnosisUse this procedure to collect data that the support team might require to diagnose fabric problems.

The information that you collect can result in a large amount of data. If you want to collect a moretargeted set of data, see the various unit and application user's guides and service guides for informationabout how to do that.

This procedure captures data from:v Vendor fabric management applications and vendor switchesv Information from IBM systems to reflect the state of the HCAs

A key application for capturing data for most fabric diagnosis activities is the Fast Fabric Toolset. Formore information, see the Fast Fabric Toolset Users Guide.

Use the Fast Fabric captureall command to gather:v Subnet manager datav Switch chassis data

Pay close attention to how the command-line parameters change from which devices data is collected.

Because all the fabric management servers and switches are connected to the same service VLAN, it ispossible to collect all the pertinent data from a single fabric management server, which was designatedwhile planning the fabric management servers. See “Planning for the fabric management server” on page47 and “QLogic Fabric Management work sheets” on page 72.

The previous references also explain and record the configuration files that are required to access thefabric management servers and switches that have the wanted data. In particular, you need to understandthe role of hosts and chassis files that list various groupings of fabric management servers and switches.

If you are performing data collection while logged on to the Cluster Systems Management/ManagementServer (CSM/MS), perform the following procedure:1. You must first have passwordless ssh set up between the fabric management server and all the other

fabric management servers and also between the fabric management server and the switches.Otherwise, you are prompted for passwords, and the dsh command does not work.

2. Log on to the CSM/MS.3. Get data from the fabric management servers by using: captureall –f <hosts file with fabric

management servers>

dsh –d primary fabric management server captureall –f hosts file with fabric managementservers

v Various hosts files must have been configured, which can help you target subsets of fabricmanagement servers if you do not require information from all the fabric management servers.

v By default, the results go into the ./uploads directory, which is below the current workingdirectory. For remote command processing, this directory will be the root directory for the user,

Clustering with high-performance computing by using InfiniBand hardware 159

Page 170: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

which is most often root. This directory could be something similar to /uploads or/home/root/uploads: it depends on the user setup on the fabric management server. This directoryis referenced as captureall_dir.

4. Get data from the switches by using: captureall –F chassis file with switches listed

dsh –d Primary fabric management server captureall –f chassis file with switches

v Various chassis files can be configured, which can help you target subsets of switches if you do notrequire information from all the switches.

v By default, the results will be copied to the ./uploads directory, which is below the current workingdirectory. For a remote command, this can be the root directory for the user, which is most oftenroot. This directory could be something similar to /uploads or /home/root/uploads, depending onthe user set up on the fabric management server. This directory is referenced as captureall_dir.

5. Copy data from the primary data collection on the Fabric Management Server to the CSM/MS:a. Make a directory on the CSM/MS to store the data. This directory is used for IBM systems data.

For the remainder of this procedure, the directory is referred to as captureDir_onCSM.b. Run the following command:

dcp –d c captureall_dir captureDir_onCSM

6. Copy health check results from the Primary Fabric Management Server to the CSM/MS. You can copyover the baseline health check and the latest. It is also advisable to copy over any recent health checkresults that contain failures.a. Make a baseline directory on the CSM/MS: mkdir captureDir_onCSM/baseline

b. Run the command:dcp –d captureDir_onCSM/baseline /var/opt/iba/analysis/baseline captureDir_onCSM/baseline

c. Make the latest directory on the CSM/MS: mkdir captureDir_onCSM/latest

d. Run the command:dcp –d captureDir_onCSM/baseline /var/opt/iba/analysis/latestcaptureDir_onCSM/latest

e. Make a directory for the failed health check runs: mkdir captureDir_onCSM/hc_fails

f. To get all failed directories, use the following dcp command. If you want to be more targeted,simply copy the directories that have the wanted failure data. The *-*:* will pick up the directorieswith timestamps for names. If you have a date in mind, you could use something similar to:2008-03-19* for March 19, 2008.dcp –d <captureDir_onCSM>/baseline /var/opt/iba/analysis/*-*:* <captureDir_onCSM>/hc_fails

7. Get HCA information from the IBM systems for the applicable operating system:a. For AIX, use the following commands:

1) lsdev | grep ib2) lscfg | grep ib3) netstat -i | egrep “ib|ml0”4) ifconfig –a5) ibstat -v

b. For Linux, use the following commands:1) lspci | grep ib2) netstat –i | egrep “ib|ml0”3) ibv_devinfo -v

8. Gather all the files and directories by using the following variable: captureDir_onCSM

Collect subnet manager and switch chassis data:

If you want to collect subnet manager and switch chassis data and do this command on the fabricmanagement server, you can issue the captureall command directly on that server:

160 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 171: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

1. Log on to the fabric management server.2. Obtain data from fabric management servers by using the following command:

captureall –f hosts file with fabric management servers

Various hosts files can be configured, which can help you target subsets of fabric management servers.3. Obtain data from the switches by using the following command: captureall –Fchassis file with switches

listed Various hosts files should have been configured, which can help you target subsets of fabricmanagement servers.By default, data is captured to files in the ./uploads directory under the current directory when yourun the command.

4. Obtain health check data from:v Baseline health check: /var/opt/iba/analysis/baselinev Latest health check: /var/opt/iba/analysis/latestv Failed health check runs: /var/opt/iba/analysis/timestamp

Capturing switch CLI output:

You can collect data directly from a switch command-line interface (CLI) by using a script command.

If you are directed to collect data directly from a switch command-line interface, typically you wouldcapture the output by using the script command, which is available on both the Linux and AIXoperating systems. The script command captures the standard output (stdout) from the Telnet or sshsession by using the switch and places it into a file.

Note: You can use some terminal emulation utilities to capture the terminal session into a log file. Thislog file might be an acceptable alternative to using the script command.

To use the script command, perform the following steps:1. On the host from which you will log in to the switch, enter the following command:

script /<dir>/switchname.capture.timestamp

v Select a directory into which to store the data.v It is helpful to have the name of the switch in the output filename.v It is helpful to put a timestamp into the output filename to differentiate it from other data collected

from the same switch. If you use the following format, you are able to sort the files easily:<4-digit year><2-digit month><2-digit day>_<2-digit hour><2-digit minute>

2. Use Telnet or ssh to access the command-line interface on the switch. For details, see the Switch UsersGuide.

3. Run the command to get the data that is being requested.4. Exit from the switch.5. To stop the script command from collecting more data, press CTRL+D.6. Send the output file to the appropriate support team.

Capturing problem data for Fabric Manager and Fast Fabric softwareIf there is a suspected problem with the Fabric Manager or Fast Fabric software, you can use theiba_capture command to capture data for debugging purposes.

The iba_capture command is documented in the Fast Fabric Users Guide.

Indications of possible software problems include:v Bad return codesv Commands that hang

Clustering with high-performance computing by using InfiniBand hardware 161

Page 172: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

v An output file from health check similar to: *.stderr

Note: Always check the switch link between the fabric management server and the subnet beforeconcluding that you have a software problem. Some commands do not check that the interface isavailable.

Mapping fabric devicesUse this information to learn how to map from a description or device name, or other logical namingconvention to a physical location of an HCA or a switch.

The mapping of switch devices is largely done by how they are named at installation and configurationtime. The switch chassis parameter for mapping is the InfiniBand device name. A good practice is tocreate names that are relative to the frame and rack in which they are populated so that it is easy tocross-reference globally unique IDs (GUIDs) to physical locations. If this mapping is not done correctly, itcan be difficult to isolate root causes when there are associated events being reported at the same time.For more information, see “Planning for QLogic InfiniBand switch configurations” on page 32 and“Installing and configuring vendor InfiniBand switches” on page 116.

Note: If it is possible to name an OEM GX/GX+ HCA by using the IBNodeDescriptor, it is advisable todo so in a manner that allows you to easily determine the server and slot in which the HCA ispopulated.

Naming of IBM GX/GX+ HCA devices by using the IBNodeDescriptor is not possible. Therefore, the usermust manually map the globally unique ID (GUID) for the HCA to a physical HCA. To do this mapping,you must understand the way GUIDs are formatted in the operating system and how they are formattedby vendor logs. While they all indicate 8 bytes of GUID, they have different formats, as illustrated in thefollowing table:

Table 77. GUID Formats

Format Example Where used

Dotted decimal 00.02.55.00.00.0f.13.00 AIX

Hexadecimal string 0x00066A0007000BBE QLogic logs

2 byte, colon delimited 0002:5500:000f:3500 Linux

If you need to isolate both sides of link by using a known device from a log or health check result, useone of the procedures listed in Table 78.

Table 78. Isolating link ports based on known information

Known information Procedure

Logical switch is known “Finding devices based on a known logical switch” on page 165

Logical HCA is known “Finding devices based on a known logical HCA” on page 166

Physical switch port is known “Finding devices based on a known physical switch port” on page168

The ibx interface is known “Finding devices based on a known ib interface (ibx/ehcax)” onpage 170

General mapping from HCA GUIDs to physicalHCAs

“Mapping of IBM HCA GUIDs to physical HCAs”

Mapping of IBM HCA GUIDs to physical HCAs:

To map IBM host channel adapter (HCA) globally unique IDs (GUIDs) to physical HCAs, you must firstunderstand the GUID assignments based on the design of the IBM GX and GX+ HCA.

162 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 173: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

For information about the structure of an IBM HCA, see “IBM GX or GX+ host channel adapters” onpage 9.

Notice that with the HCA structure, IBM HCA node GUIDs are relative to the entire HCA. These nodeGUIDs always end in 00, for example, 00.02.55.00.00.0f.13.00. The final 00 changes for each port on theHCA.

Note: If at all possible, during installation, it is advisable to issue a query to all servers to gather theHCA GUIDs ahead of time. If the query has been done, you might then simply query a file for thewanted HCA GUID. A method to do this query is documented in “Installing the fabric managementserver” on page 91.

There is an HCA port for each physical port, which maps to one of the logical switch ports. There is alsoan HCA port for each logical HCA assigned to a logical partition. Therefore, IBM HCA port GUIDs areshown as:[7 bytes of node GUID][1 byte of port ID]

Examples of Port GUIDs are:v 00.02.55.00.00.0f.13.01v 00.02.55.00.00.0f.13.81

Because there are so many HCAs in a cluster, it is best to get a map of the HCA GUIDs to the physicalHCAs and store it in a file or print it. If you do not store the map, you must look it up each time youneed it, by using the AIX or Linux operating system.

The best way to map the HCA GUIDs to the physical HCAs is by using operating system commands togather HCA information. You can do this by using the dsh command to all servers simultaneously. Thecommands used depend on the operating system in the logical partition.

Gathering HCA information by using AIX: In the AIX operating system, the following commands are usedto query for port and node GUIDs from an AIX logical partition:v ibstat -n returns overall node information

– ibstat -n | grep GUID returns the base GUID for the HCA. You can use this command to map theother GUID information, because the last byte is the one that varies based on ports and logicalHCAs. The first 7 bytes are common across ports and logical HCAs.

v ibstat -p > returns port information.– ibstat -p | egrep "GUID|PORT" = returns the port number and the GUIDs associated with that port.

Note: It can take up to a minute for the commands to return information.

Use CSM to get all HCA GUIDs in AIX logical partitions, use one of the following command strings,depending on your situation:

To receive mapping information if all your servers are running the AIX operating system, issue thefollowing command:> dsh -av ’ibstat -n | grep GUID’

node1: Globally Unique ID (GUID): 00.02.55.00.00.0f.13.00node2: Globally Unique ID (GUID): 00.02.55.00.00.0b.f8.00

To receive mapping information for just the AIX logical partitions in a mixed operating systemenvironment, issue the following command:dsh -v -N AIXNodes ’ibstat -n | grep GUID’

Clustering with high-performance computing by using InfiniBand hardware 163

Page 174: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

The information you receive provides enough information to map any HCA GUID to a node or system.For example, the logical switch port 1 of an HCA might have a final byte of 01. So, the node1, port 1GUID would be 00.02.55.00.00.0f.13.01.

If you do not have a stored map of the HCA GUIDs, but you have a GUID for which you want to search,use the following command for AIX logical partitions. Using the first 7 bytes of the GUID allows for amatch to be made when you do not have the port GUID information available from the ibstat –ncommand.dsh -av ’ibstat -n | grep GUID | grep "[1st seven bytes of GUID]"’

The information you receive provides enough information to identify the physical HCA and port withwhich you are working.

After you know the server in which the HCA is populated, you can issue the ibstat –p command to theserver to find out which HCA matches the GUID that you have.

Gathering HCA information by using Linux: The following commands are used to query port and nodeGUIDs from a Linux logical partition:v ibv_devinfo -v = returns attributes of the HCAs and their ports

– ibv_devinfo -v | grep "node_guid" returns the node GUID– ibv_devinfo -v | egrep "GID|port:" returns GIDs for ports. The first 8 bytes are a GID mask, and

the second are the port GUIDv ibv_devinfo -l returns the list of HCA resources for the logical partitionv ibv_devinfo -d [HCA resource] returns the attributes of the HCA given in [HCA resource]. The HCA

resource names are returned in ibv_devinfo -l

v ibv_devinfo -i [port number] returns attributes for a specific portv man ibv_devinfo to get more details on ibv_devinfo

To use CSM to get all HCA GUIDs in Linux logical partitions, use one of the following command strings,depending on your situation:

To receive mapping information if all of your servers are running the Linux operating system, issue thefollowing command:> dsh -av ’/usr/bin/ibv_devinfo -n | grep class=code-quote>"node_guid"’

node1: node_guid: 0002:5500:1002:5800node2: node_guid: 0002:5500:100b:f800

To receive mapping information for just Linux logical partitions when you are running mixed operatingsystems, issue the following command:dsh -v -N LinuxNodes ’/usr/bin/ibv_devinfo -n | grep class=code-quote>"node_guid"’

If you do not have a stored map of the HCA GUIDs, but you have a GUID for which you want to search,use the following command for Linux logical partitions. Using the first 7 bytes of the GUID allows for amatch to be made when you do not have the port GUID information available from the ibv_devinfo –vcommand.> dsh -av ’/usr/bin/ibv_devinfo -n | grep class=code-quote>"node_guid" | grep "[1st seven bytes of GUID]"’

The information you receive provides enough information to identify the physical HCA and port withwhich you are working.

After you know the server in which the HCA is populated, you can issue an ibv_devinfo –i port numbercommand to the server to determine which HCA matches the GUID that you have.

164 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 175: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Finding devices based on a known logical switch:

Use this procedure if the logical switch in a host channel adapter (HCA) is known, and the attachedswitch and physical HCA port must be determined.

This procedure applies to IBM GX HCAs.

Note: Some of the steps in this procedure to query an HCA device are specific to the AIX or Linuxoperating system. For AIX, the adapter is called ibax, where x is 0 - 3. For Linux, the adapter is callehcax, where x is 0 - 3.

For example, a log entry similar to the following example is reported with the logical switch port beingreported. Here, the logical switch information is underlined and in bold. Note the node type. To theInfiniBand fabric, the logical switch on the HCA appears as a switch.Apr 15 09:25:23 c924hsm.ppd.pok.ibm.com local6:notice c924hsmiview_sm[26012]: c924 hsm; MSG:NOTICE|SM:c924hsm:port 1|COND:#4Disappearance from fabric|NODE:IBM G2 logical Switch 1:port0:0x00025500103a7202|DETAIL:Node type: switch

To find the physical switch connection and node, and the HCA port and location, complete the followingsteps. The previous log is used as an example, and example results from any queries also are provided.1. Obtain the logical switch GUID and record which logical switch it is in the HCA. The GUID is

0x00025500103a7202, logical switch number 1.2. Log on to the fabric management server.3. Find the logical switch GUID. This query returns the logical switch side of a link as the first port of

the link and the physical switch port as the second port in the link.a. If the baseline health check has been run, use the following command. If it has not been run, go to

step 3b.grep –A 1 “0g *[GUID]” /var/opt/iba/analysis/baseline/fabric*links

b. If the baseline health check has not been run, query the live fabric by using the followingcommand:iba_report –o links | grep –A 1 “0g *[GUID]”

Example results:> grep –A 1 “0g * 0x00025500103a7202” /var/opt/iba/analysis/baseline/fabric*links

20g 0x00025500103a7202 1 SW IBM G2 logical Switch 1<-> 0x00066a00d90003d3 3 SW SilverStorm 9024 DDR GUID=0x00066a00d90003d3

The physical switch port is in the last line of the results of the query.4. Obtain the name and port for the switch. The name indicates where the switch is physically located.

<-> switch GUID port SW switch name/IBnodeDescription

Example results: Port 3 on switch SilverStorm 9024 DDR has GUID of 0x00066a00d90003d3

This switch has not been renamed and is using the default naming convention, which includes theswitch model and GUID.

5. Log on to the CSM Management Server.6. Find the server and HCA port location:

Note: If you have a map of HCA GUIDs to server locations, use that to find in which server the HCAis located, and skip step 6a.a. Convert the logical switch GUID to operating system format, which drops the 0x, and uses a dot

or colon to delimit bytes:v For AIX, a dot delimits each byte: 0x00025500103a7202 becomes 00.02.55.00.10.3a.72.02

v For Linux, a colon delimits 2 bytes: 0x00025500103a7202 becomes 0002:5500:103a:7202

Clustering with high-performance computing by using InfiniBand hardware 165

Page 176: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

b. Drop the last 2 bytes from the GUID (00.02.55.00.10.3a for the AIX operating system or0002.5500.103a.72 for the Linux operating system).

c. Run the following command to find the server and adapter number for the HCA.v For the AIX operating system use the following command:

dsh -v -N AIXNodes ’ibstat -p | grep -p "[1st seven bytes of GUID]" | grep iba’

Example results:>dsh -v -N AIXNodes ’ibstat -p | grep -p "00.02.55.00.10.3a.72" | grep iba’

c924f1ec10.ppd.pok.ibm.com: IB PORT 1 INFORMATION (iba0)c924f1ec10.ppd.pok.ibm.com: IB PORT 2 INFORMATION (iba0)

v For Linux, use the following command:dsh -v -N LinuxNodes ’ibv_devinfo| grep –B1 "[1st seven bytes of GUID]" | grep ehca’

Example results:>dsh -v -N AIXNodes ’ibv_devinfo | grep –B1 "0002:5500:103a:72" | grep ehca’

hca_id: ehca0

d. The server is in the first field and the adapter number is in the last field.(c924f1ec10.ppd.pok.ibm.com and iba0 in AIX, or ehca0 in Linux)

e. To find the physical location of the logical switch port, use the logical switch number and the ibadevice found previously with the information in the following procedure: “IBM GX HCA physicalport mapping based on device number” on page 172.Example results:iba0/ehca0 and logical switch 1 map to C65-T1

Therefore, for c924f1ec10, C65-T1 is attached to port 3 of SilverStorm 9024 DDR with a GUID of0x00066a00d90003d3.

Related concepts

“IBM GX or GX+ host channel adapters” on page 9The IBM GX or GX+ host channel adapter (HCA) provides server connectivity to InfiniBand fabrics.

Finding devices based on a known logical HCA:

Use this procedure if the logical host channel adapter (HCA) is known, and the attached switch andphysical HCA port must be determined.

This procedure applies to IBM GX HCAs.

Note: Some of the steps in this procedure regarding querying an HCA device are specific to the AIX orLinux operating system. For the AIX operating system, the adapter is called ibax, where x is 0 - 3. For theLinux operating sytsem, the adapter is called ehcax, where x is 0 - 3.

For example, a log entry similar to the following example is reported with the logical HCA beingreported. Here, the logical HCA information is underlined and in bold. The node type is an HCA.Apr 15 09:25:23 c924hsm.ppd.pok.ibm.com local6:notice c924hsmiview_sm[26012]: c924hsm; MSG:NOTICE|SM:c924hsm:port 1|COND:#4 Disappearance fromfabric|NODE:IBM G2 Logical HCA :port 1:0x00025500103a7200|DETAIL:Node type: hca

To find the physical switch connection and node, and the HCA port and location, complete the followingsteps. Use the previous log as an example. Also, additional example results from queries are provided.1. Obtain the logical HCA GUID and record which logical HCA it is in the HCA. Also note that the port

has a GUID of 0x00025500103a7200 and port of 1.2. Log on to the fabric management server.

166 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 177: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

3. Obtain the logical HCA GUID and port. This query returns the logical HCA side of a link as the firstport of the link and the logical switch port as the second port in the link.a. If the baseline health check has been run, use the following command. If it has not been run, go to

step 3b.grep –A 1 “0g *[GUID] *[port]”/var/opt/iba/analysis/baseline/fabric*links

b. If the baseline health check has not been run, query the live fabric by using the followingcommand:iba_report –o links | grep –A 1 “0g *[GUID] *[port]”

Example results:> grep –A 1 “0g 0x00025500103a7200* *1” /var/opt/iba/analysis/baseline/fabric*link

60g 0x00025500103a7200 1 CA IBM G2 Logical HCA<-> 0x00025500103a7202 2 SW IBM G2 Logical Switch 1

4. Obtain the name for the logical switch. The logical switch port is in the last line of the results of thequery. This information can be used to find out which logical switch attaches to the physical switchport.<-> [logical switch GUID] [port] SW [logical switch name/IBnodeDescription]

Example results:Logical Switch 1

5. Find the logical switch GUID. This query returns the logical switch side of a link as the first port ofthe link and the physical switch port as the second port in the link.a. If the baseline health check has been run, use the following command. If it has not been run, go to

step 5b.grep –A 1 “0g *[GUID]” /var/opt/iba/analysis/baseline/fabric*links

b. If the baseline health check has not been run, query the live fabric by using the followingcommand.iba_report –o links | grep –A 1 “0g *[GUID]”

Example results:20g 0x00025500103a7202 1 SW IBM G2 Logical Switch 1<-> 0x00066a00d90003d3 3 SW SilverStorm 9024 DDR GUID=0x00066a00d90003d3

6. Obtain the name and port for the switch. The physical switch port is in the last line of the results ofthe query. The name indicates where the switch is physically.<-> [switch GUID] [port] SW [switch name/IBnodeDescription]

7. Find the physical switch connection. Port 3 on switch SilverStorm 9024 DDR with a GUID of0x00066a00d90003d3. This switch has not been renamed and is using the default naming convention,which includes the switch model and GUID.

8. Log on to the CSM Management Server.9. Find the server and HCA port location.

Note: If you have a map of HCA GUIDs to server locations, use that to find in which server the HCAis located, and skip step 9a.a. Convert the logical switch GUID to operating system format, which drops the 0x and uses a dot or

colon to delimit bytes:v For AIX, a dot delimits each byte: 0x00025500103a7202 becomes 00.02.55.00.10.3a.72.02

v For Linux, a colon delimits 2 bytes: 0x00025500103a7202 becomes 0002:5500:103a:7202

b. Drop the last 2 bytes from the GUID (00.02.55.00.10.3a for AIX or 0002.5500.103a.72 for Linux).c. Run the following command to find the server and adapter number for the HCA.v For AIX, use the following command:

dsh -v -N AIXNodes ’ibstat -p | grep -p "[1st seven bytes of GUID]"| grep iba’

Clustering with high-performance computing by using InfiniBand hardware 167

Page 178: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Example results:>dsh -v -N AIXNodes ’ibstat -p | grep -p "00.02.55.00.10.3a.72" |grep iba’

c924f1ec10.ppd.pok.ibm.com: IB PORT 1 INFORMATION (iba0)c924f1ec10.ppd.pok.ibm.com: IB PORT 2 INFORMATION (iba0)

v For Linux, use the following command:dsh -v -N LinuxNodes ’ibv_devinfo| grep –B1 "[1st seven bytes ofGUID]" | grep ehca’

Example results:>dsh -v -N AIXNodes ’ibv_devinfo | grep –B1 "0002:5500:103a:72" |grep ehca’

hca_id: ehca0

The server is in the first field and the adapter number is in the last field.(c924f1ec10.ppd.pok.ibm.com and iba0 in AIX, or ehca0 in Linux)

d. To find the physical location of the logical switch port, use the logical switch number and ibadevice found previously by using the following procedure: “IBM GX HCA physical port mappingbased on device number” on page 172.Example results:iba0/ehca0 and logical switch 1 map to C65-T1

Therefore, c924f1ec10: C65-T1 is attached to port 3 of SilverStorm 9024 DDR with a GUID of0x00066a00d90003d3.

Related concepts

“IBM GX or GX+ host channel adapters” on page 9The IBM GX or GX+ host channel adapter (HCA) provides server connectivity to InfiniBand fabrics.

Finding devices based on a known physical switch port:

Use this procedure if the physical switch port is known, and the attached physical HCA port must bedetermined.

This procedure applies to IBM GX HCAs.

Note: This procedure has some steps that are specific to the AIX or Linux operating system. This has todo with querying the HCA device from the operating system. For AIX, the adapter is called ibax, where xis 0 - 3. For Linux, the adapter is called ehcax, where x is 0 - 3.

For example, a log entry similar to the following example is reported with the physical switch port beingreported. Here, the physical information is underlined and in bold. The node type is a switch.Apr 15 09:25:23 c924hsm.ppd.pok.ibm.com local6:notice c924hsmiview_sm[26012]: c924hsm; MSG:NOTICE|SM:c924hsm:port 1|COND:#4 Disappearance from fabric|NODE:SWSilverStorm 9024 DDR GUID=0x00066a00d90003d3 :port11:0x00066a00d90003d3|DETAIL:Node type: switch

The format of the switch “node” is [name]:[port]:[GUID].

The following procedure finds the physical switch connection and node and the HCA port and location.The previous log is used as an example, and example results from any queries are also provided.1. Get the switch GUID and port.

For example, a GUID of 0x00066a00d90003d3 and port of 11.2. Log on to the fabric management server.

168 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 179: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

3. Find the logical switch name. This query returns the switch side of a link as the second port of thelink and the logical switch port as the first port in the link.a. If the baseline health check has been run, use the following command. If it has not been run, go to

step 3b.grep –A 1 “> *[switch GUID] *[switch port]”/var/opt/iba/analysis/baseline/fabric*links

b. If the baseline health check has not been run, query the live fabric by using the followingcommand.iba_report –o links | grep –A 1 “0g *[switch GUID] *[switch port]”

Example results:> grep –A 1 “> *Courier; 0x00066a00d90003d3 *11”/var/opt/iba/analysis/baseline/fabric*links

20g 0x00025500103a6602 1 SW IBM G2 Logical Switch 1<-> 0x00066a00d90003d3 11 SW SilverStorm 9024 DDR GUID=0x00066a00d90003d3

4. Obtain the name for the logical switch. The logical switch is in the second to last line of the results ofthe query. This name tells you which logical switch attaches to the physical switch port.<-> [logical switch GUID] [port] SW [logical switch name/IBnodeDescription]

Example results:Logical Switch 1

5. Log on to the CSM Management Server.6. Find the server and HCA port location:

Note: If you have a map of HCA GUIDs to server locations, use that to find in which server the HCAis located, and skip step 6a.a. Convert the logical switch GUID to operating system format, which drops the 0x and uses a dot or

colon to delimit bytes:v For the AIX operating system, a dot delimits each byte: 0x00025500103a7202 becomes

00.02.55.00.10.3a.72.02

v For the Linux operating system, a colon delimits 2 bytes: 0x00025500103a7202 becomes0002:5500:103a:7202

b. Drop the last 2 bytes from the GUID (00.02.55.00.10.3a for AIX 0002.5500.103a.72 for Linux)c. Run the following command to find the server and adapter number for the HCA.v For AIX use the following command:

dsh -v -N AIXNodes ’ibstat -p | grep -p "[1st seven bytes of GUID]" | grep iba’

Example results:>dsh -v -N AIXNodes ’ibstat -p | grep -p "00.02.55.00.10.3a.72" |grep iba’

c924f1ec10.ppd.pok.ibm.com: IB PORT 1 INFORMATION (iba0)c924f1ec10.ppd.pok.ibm.com: IB PORT 2 INFORMATION (iba0)

v For the Linux operating system, use the following command:dsh -v -N LinuxNodes ’ibv_devinfo| grep –B1 "[1st seven bytes ofGUID]" | grep ehca’

Example results:>dsh -v -N AIXNodes ’ibv_devinfo | grep –B1 "0002:5500:103a:72" |grep ehca’

hca_id: ehca0

The server is in the first field and the adapter number is in the last field.(c924f1ec10.ppd.pok.ibm.com and iba0 in AIX, or ehca0 in Linux)

Clustering with high-performance computing by using InfiniBand hardware 169

Page 180: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

d. To find the physical location of the logical switch port, use the logical switch number and the ibadevice found previously by using the following procedure: “IBM GX HCA physical port mappingbased on device number” on page 172.Example results:iba0/ehca0 and logical switch 1 map to C65-T1

Therefore, c924f1ec10: C65-T1 is attached to port 3 of SilverStorm 9024 DDR with a GUID of0x00066a00d90003d3.

Related concepts

“IBM GX or GX+ host channel adapters” on page 9The IBM GX or GX+ host channel adapter (HCA) provides server connectivity to InfiniBand fabrics.

Finding devices based on a known ib interface (ibx/ehcax):

Use this procedure if the InfiniBand (ib) interface number is known, and the physical HCA port andattached physical switch port must be determined.

This procedure applies to IBM GX HCAs.

Note: This procedure has some steps that are specific to operating system type (AIX or Linux). Thesesteps have to do with querying the HCA device from the operating system. For AIX, the adapter is calledibax, where x is 0 - 3. For Linux, the adapter is call ehcax, where x is 0 - 3.

For example, if there is a problem with ib0, use the following procedure to determine the physical HCAport and physical switch port associated with the problem.1. Record the ib interface number and server, for example, ib1 on c924f1ec09.2. Log on to the server with the ib interface that is of interest.3. From netstat, obtain the logical HCA GUID associated with the ib interface:v For the AIX operating system, use netstat –I [ib interface]. You need to add leading zeros to

bytes that are returned with single digits. You need the last 8 bytes of the address.Example results:> netstat –I ib1

Name Mtu Network Address Ipkts Ierrs Opkts OerrsCollib1 65532 link#3 0.0.0.b.fe.80.0.0.0.0.0.1.0.2.55.0.10.24.d9.165 0 7 0 0ib1 65532 192.168.9 192.168.9.65 65 0 7 00

Therefore, a GUID of 0.2.55.0.10.24.d9.1 equals 00.02.55.00.10.24.d9.01.v For the Linux operating system, use ifconfig [ib interface];

Example results:> ifconfig ib0 | grep inet6

inet6 addr: fe80::202:5500:1024:d900/64 Scope:Link

Therefore, a GUID of 02:5500:1024:d900 equals 0002:5500:1024:d900 after you have added theleading zeroes.

4. Obtain the adapter device by performing the following actions:v For the AIX operating system, use the following command:

ibstat -p | grep -p "[1st seven bytes of GUID]" | grep iba

Example results:

170 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 181: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

> ibstat -p | grep -p "00.02.55.00.10.24.d9" | grep iba

IB PORT 1 INFORMATION (iba0)IB PORT 2 INFORMATION (iba0)

The device is iba0.v For the Linux operating system, use the following command:

ibv_devinfo| grep –B1 "[1st seven bytes of GUID]" | grep ehca

Example results:ibv_devinfo | grep –B1 "02:5500:1024:d9" | grep ehca

hca_id: ehca0

The device is ehca0.5. Find the logical switch that is associated with the logical HCA for the interface.6. Log on to the fabric management server.7. Translate the operating system representation of the logical HCA GUID to the subnet manager

representation of the GUID.a. For GUIDs reported by AIX, delete the dots: 00.02.55.00.10.24.d9.00 becomes

000255001024d900

b. For GUIDs reported by Linux, delete the colons: 0002:5500:1024:d900 becomes 000255001024d900

8. Find the logical HCA GUID connection to the logical switch:a. If the baseline health check has been run, use the following command. If it has not been run, go

to step b.grep –A 1 “0g *[GUID] *[port]” /var/opt/iba/analysis/baseline/fabric*links

b. If the baseline health check has not been run, query the live fabric by using the followingcommand.iba_report –o links | grep –A 1 “0g *[GUID] *[port]”

Example results:> grep –A 1 “0g 0x00025500103a7200* *1” /var/opt/iba/analysis/baseline/fabric*link

60g 0x000255001024d900 1 CA IBM G2 Logical HCA<-> 0x000255001024d902 2 SW IBM G2 Logical Switch 1

The logical switch port is in the last line of the results of the query. This command tells youwhich logical switch attaches to the physical switch port.

9. Record the logical switch GUID. To obtain the name for the logical switch, enter the followingcommand:<-> [logical switch GUID] [port] SW [logical switch name/IBnodeDescription]

Example results:Logical switch 1; logical switch GUID = 0x0025501024d902

10. To find the physical location of the logical switch port, use the logical switch number and iba devicefound previously with the procedure: “IBM GX HCA physical port mapping based on devicenumber” on page 172.Example results:iba0/ehca0 and logical switch 1 map to C65-T1

11. Find the physical switch connection to the logical switch:a. If the baseline health check has been run, use the following command. If it has not been run, go

to step b.grep –A 1 “0g *[GUID]” /var/opt/iba/analysis/baseline/fabric*links

b. If the baseline health check has not been run, query the live fabric by using the followingcommand.iba_report –o links | grep –A 1 “0g *[GUID]”

Clustering with high-performance computing by using InfiniBand hardware 171

Page 182: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Example results:> grep –A 1 “0g * 0x00025500103a7202” /var/opt/iba/analysis/baseline/fabric*links

20g 0x000255001024d902 1 SW IBM G2 Logical Switch 1<-> 0x00066a00d90003d3 3 SW SilverStorm 9024 DDR GUID=0x00066a00d90003d3

12. Obtain the name and port for the switch. The physical switch port is in the last line of the results ofthe query. The name might have been given such that it indicates where the switch is physicallylocated.<-> [switch GUID] [port] SW [switch name/IBnodeDescription]

Example results:Port 3 on switch SilverStorm 9024 DDR with a GUID of 0x00066a00d90003d3

This switch has not been renamed and is using the default naming convention, which includes theswitch model and GUID.Therefore, for ib0 in the server, the C65-T1 HCA port is attached to port 3 of SilverStorm 9024 DDRwith a GUID of 0x00066a00d90003d3.

Related concepts

“IBM GX or GX+ host channel adapters” on page 9The IBM GX or GX+ host channel adapter (HCA) provides server connectivity to InfiniBand fabrics.

IBM GX HCA physical port mapping based on device numberUse this information to find the IBM GX HCA physical port based on the iba device and logical switchnumber.

Use the following table to find IBM GX HCA physical port based on iba device and logical switchnumber.

Table 79. IBM GX HCA physical port mapping from iba device and logical switch

Device (iba) Logical Switch 9125-F2A 8203-EA4 8204-EA8

iba0/ehca0 1 C65-T1 Cx-T1 Cx-T1

iba0/ehca0 2 C65-T2 Cx-T2 Cx-T2

iba1/ehca1 1 C65-T3

iba1/ehca1 2 C65-T4

iba2/ehca2 1 C66-T1

iba2/ehca2 2 C66-T2

iba3/ehca3 1 C66-T3

iba3/ehca3 2 C66-T4

Related concepts

“IBM GX or GX+ host channel adapters” on page 9The IBM GX or GX+ host channel adapter (HCA) provides server connectivity to InfiniBand fabrics.

Interpreting switch log formats from vendorsSeveral methods are available to help you interpret switch logs from vendor companies.

Log severities:

Use this information to find the severity levels of logs used by QLogic switches and subnet managers.

The severities in the following table are standard syslog priority levels. Priority is the term that is used torefer to severity in a syslog entry.

172 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 183: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 80. QLogic log severities

Severity Significance Example

Error v Need immediate action, forexample:

– Have severity level aboveInformation, Notice, andWarning

– Logged to CSM eventmanagement

The voltage level is outside theacceptable operating range.

The temperature rose above thecritical threshold.

Warning v Action can be deferred, forexample:

– Have severity level aboveInformation and Notice, andbelow Error

– Logged to CSM eventmanagement

The field replaceable unit (FRU) statechanged from online to offline.

Power supply n+1 redundancy is notavailable.

Notice v Actionable events that could be aresult of a user action or failure:

– Have severity level aboveInformation and below Warningand Error

– Logged to CSM eventmanagement

The switch chassis managementsoftware rebooted.

The FRU state changed from notpresent to present.

Information v Events that do not require anyaction, including:

– Have severity level belowNotice, Warning, and Error

– Provide advanced level ofengineering debugginginformation useful forpostmortem analysis

– Optionally logged in/var/log/csm/syslog.fabric.infoon the Cluster SystemsManagement/ManagementServer (CSM/MS)

The I2C system passes POST.

A connection was requested by the<ip_address> using telnetd.

Switch chassis management log format:

The switch chassis management code logs problems with the switch chassis for things such as power andcooling and logic issues, or other hardware failures not covered by the subnet manager.

The source for switch chassis management logs is on the switch. When remote logging and CSM eventmanagement is set up as in “Setting up remote logging” on page 95, these logs are also available on theCSM/MS.

The log format for switch chassis management logs is as follows. The key to recognizing a switch chassislog is that it contains the string |CHASSIS: after the MSG: <msgType> string.

Note: This format is for entries with a severity of Notice or higher. Information messages are not boundby this format, and are for engineering use.

To obtain the switch chassis management logs, enter the following command:

Clustering with high-performance computing by using InfiniBand hardware 173

Page 184: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

<prefix>;MSG: <msgType>| CHASSIS: <location>|COND: <condition>|FRU:<fru>|PN: <part number>|DETAIL: <details>

<prefix> = timestamp and card slot number and IP address of the unitreporting the error.

<msgType> is one of the following values: Error, Warning, Notice,Information

<location> is the value from the user settable field called InfiniBandNode Description on the System tab of the GUI or by the CLI command“setIBNodeDesc”. Up to 64 characters. The default location is to the GUID.

<condition> is one of the conditions from the CHASSIS Reporting Table.Text includes a unique ID number.

<fru> associated with the condition.

<part number> is an ASCII text field that identifies the QLogic partnumber for the associated FRU.

<details> is optional information that is relevant to the particularevent.

Example of a switch chassis management log entry:

Related concepts

“Vendor log flow to CSM event management” on page 26The integration of vendor and IBM log flows is a critical factor in event management.

Subnet manager log format:

The subnet manager logs information about the fabric. This information includes events such as linkproblems, devices appearing and disappearing from the fabric, and, information regarding when it issweeping the network.

The subnet manager log can be either on a switch in the same log as the switch chassis management log(for embedded subnet managers) or in the syslog (/var/log/messages) of the fabric management server(for host-based subnet managers).

When remote logging and CSM event management is set up as in “Setting up remote logging” on page95, the subnet manager logs are also available on the Cluster Systems Management/Management Server(CSM/MS).

The format of the subnet manager log is as follows. The key to recognizing a subnet manager log entry isthe string |SM: following the string MSG: <msgType>.

Note: This format is for entries with a severity of Notice or higher. Information messages are not boundby this format, and are for engineering use.<prefix>;MSG:<msgType>| SM:<sm_node_desc>:port <sm_port_number>|COND:<condition>|NODE:<node_desc>:port <port_number>:<node_guid>|LINKEDTO:<linked_desc>:port <linked_port>:<linked_guid>|DETAIL:<details>

<prefix> timestamp and card slot number or host name and IP address of the

Oct 9 18:54:37 slot101:172.21.1.29;MSG:NOTICE|CHASSIS:SilverStorm 9024GUID=0x00066a00d8000161|COND:#9999 This is a notice event test|FRU:PowerSupply 1|PN:200667-000|DETAIL:This is an additional information about theevent

174 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 185: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

unit reporting the msg

<msgType> is one of the following values: Error, Warning, Notice,Information

<sm_node_desc> and <sm_port_number> indicate the node name and port numberof the SM that is reporting the message. For ESM, port number=0.

<condition> is one of the conditions from the event SM Reporting Tabletext that includes a unique ID number

<node_desc>, <port_number>, and <node_guid> are the InfiniBand NodeDescription, Port Number and Node GUID of the port and node that areprimarily responsible for the event.

<linked_desc>:<linked_port>:<linked_guid> are optional fields describingthe other end of the link described by the <node_desc>, <port_number>, and<node_guid> fields. These fields and the 'LINKEDTO’ keyword onlyappear in applicable messages.

<details> is an optional free-form field with additional informationfor diagnosing the cause.

Example of a subnet manager log entry:

Related concepts

“Vendor log flow to CSM event management” on page 26The integration of vendor and IBM log flows is a critical factor in event management.

Diagnosing problems with a clusterThere are many aspects of diagnosing problems with a cluster.

Diagnosing link errorsUse this procedure to isolate link errors to a field replacement unit (FRU).

Symptoms that lead to this procedure include:

Symptom Reporting mechanism

Link down message; HCA resource (logical switch,logical HCA, end node) disappearance reported

QLogic log, or Cluster Systems Management/Management Server (CSM/MS) log containing QLogiclogs: /var/log/csm/errorlog/[CSM/MS hostname]

HCA resource (logical switch, logical HCA, node)disappearance reported

FastFabric health checking with .diff file

LED on switch or HCA showing link down LEDs; Chassis Viewer; Fabric Viewer

Use the following procedure to isolate a link error to a FRU. Be sure to record which steps you havetaken in case you have to contact your next level of support, or in case QLogic must be contacted.

The basic flow of the procedure is:

Oct 10 13:14:37 slot 101:172.21.1.9; MSG:ERROR| SM:SilverStorm 9040GUID=0x00066a00db000007 Spine 101, Chip A:port 0| COND:#99999 LinkIntegrity Error| NODE:SilverStorm 9040 GUID=0x00066a00db000007 Spine 101,Chip A:port 10:0x00066a00db000007 | LINKEDTO:9024 DDRGUID=0x00066a00d90001db:port 15:0x00066a00d90001db|DETAIL:Excessive BufferOverrun threshold trap received.

Clustering with high-performance computing by using InfiniBand hardware 175

Page 186: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

1. Determine if the link errors might be symptoms caused by a user action (such as a restart) or anothercomponent failing (such as a switch, or a server).

2. Determine the physical location of both ends of the cable.3. Isolate to the FRU4. Repair the FRU5. Verify that the link is fixed6. Verify that the configuration was not inadvertently changed7. If a switch component, or HCA was replaced, perform a new health check baseline8. Exit the procedure

Notes:

1. During this procedure, you might need to swap ports to which a cable end is connected. Be sure thatyou do not swap ports with a link connected to a fabric management server. This action canjeopardize fabric performance and also capability to do some verification procedures.

2. Once you have fixed the problem, or cannot find a problem after doing anything to disturb the cable,HCA or switch components associated with the link, it is important to perform the Fast Fabric healthcheck described in step 16 on page 178 to ensure that you have returned the cluster fabric to theintended configuration. The only changes in configuration should be VPD information from replacedparts.

3. If you replace the managed spine for the switch chassis, you need to redo the switch chassis setup forthe switch as described in “Installing and configuring vendor InfiniBand switches” on page 116.

1. If this task is a switch to switch link, use the troubleshooting guide from QLogic. Contact QLogicservice and exit this procedure.

2. If this task is an IBM HCA to switch link, continue to the next step.3. Map the IBM HCA GUID and port information to a physical location, and determine the switch

physical location by using the procedure in “Mapping fabric devices” on page 162.4. Before proceeding, check for other link problems in the CSM Event Management Log.5. If there is an appearance notification after a disappearance notification for the link, it is possible that

the HCA link bounced, or the node has rebooted.6. If every link attached to a server is reported as down, or all of them have been reported

disappearing and then appearing, perform the following steps:a. Check to see if the server is powered-off or had been restarted. If the server has been

powered-off or restarted, the link error is not a serviceable event; therefore, you can end thisprocedure.

b. The server is not powered-off nor had it been restarted. The problem is with the HCA. Replacethe HCA by using the Serviceability task on the Hardware Management Console (HMC) whichmanages the server in which the HCA is populated, and exit this procedure.

7. If every link attached to the switch chassis has gone down, or all of them have been reporteddisappearing and then appearing, perform the following steps:a. Check to see if the switch chassis is powered-off or was powered-off at the time of the error. If

this is true, the link error is not a serviceable event; therefore, you can end this procedure.b. If the switch chassis is not powered-off nor was it powered-off at the time of the error, the

problem is in the switch chassis. Contact QLogic service and exit this procedure.8. If more than two links attached to a switch chassis have gone down, but not all the links with cables

have gone down or been reported disappearing and then appearing, the problem is in the switchchassis. Contact QLogic service and exit this procedure.

9. Check the HMC for serviceable events against the HCA. If the HCA was reported as part of a FRUlist in a serviceable event. This link error is not a serviceable event; therefore, no repair is required inthis procedure. If you replace the HCA or a switch component based on the serviceable event, go tostep 16 on page 178 in this procedure. Otherwise, you can exit this procedure.

176 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 187: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

10. Check the LEDs of the HCA and switch port comprising the link. Use the IBM system Manual todetermine if the HCA LED is in a valid state and use the QLogic switch Users Guide to determine ifthe switch port is in a valid state. In each case, the LED should be lit if the link is up and unlit if thelink is down.

11. Check the seating of the cable on the HCA and the switch port. If it appears unseated, reseat thecable and do the following steps. Otherwise go to the next step.a. Check the LEDs.b. If the LEDs light, the problem is resolved. Go to step 16 on page 178.c. If the LEDs do not light, go to the next step.

12. Check the cable for damage. If the cable is damaged, perform the following procedure. Otherwise,proceed to the next step.a. Replace the cable. Before replacing the cable, check the manufacturer and part number to ensure

that it is an approved cable. Approved cables are available in the IBM clusters with theInfiniBand switch Web site.

b. Perform the procedure in “Verifying link FRU replacements” on page 207.c. If the problem is fixed, go to step 16 on page 178. If the problem is not fixed, go to the next step.

13. If there are open ports on the switch, do the following steps. Otherwise, go to step 14.a. Move the cable connector from the failing switch port to the open switch port.b. In order to see if the problem has been resolved, or it has moved to the new switch port, use the

procedure in “Verifying link FRU replacements” on page 207.c. If the problem was “fixed”, then the failing FRU is on the switch. Engage QLogic for repair. Once

the repair has been made, go to step 16 on page 178. If the problem was not fixed by swappingports, proceed to the next step.

d. If the problem was not “fixed” by swapping ports, then the failing FRU is either the cable or theHCA. Return the switch port end of the cable to the original switch port.

e. If there is a known good HCA port available for use, swap between the failing HCA port cableend to the known good HCA port. Then, do the following steps. Otherwise proceed to the nextstep.1) Use the procedure in “Verifying link FRU replacements” on page 207.2) If the problem was “fixed”, replace the HCA by using the Repair and Verify procedures for

the server and HCA. Once the HCA is replaced, go to step 16 on page 178.3) If the problem was not “fixed”, the problem is the cable. Engage QLogic for repair. Once the

repair has been made, go to step 16 on page 178.f. If there is not a known good HCA port available for use, and the problem has been determined to

be the HCA or the cable, replace the FRUs is the following order:1) Engage QLogic to replace the cable, and verify the fix by using the procedure in “Verifying

link FRU replacements” on page 207. If the problem is fixed, go to step 16 on page 178.

Note: Before replacing the cable, check the manufacturer and part number to ensure that it isan approved cable. Approved cables are available in the IBM Clusters with the InfiniBand SwitchWeb site referenced in Table 2: General Cluster Information Resources, on page 16.

2) If the cable does not fix the problem, replace the HCA, and verify the fix by using theprocedure in “Verifying link FRU replacements” on page 207. If the problem is fixed, go tostep 16 on page 178.

3) If the problem is still not fixed, contact your next level of support. If any repairs are madeunder direction from support, go to step 16 on page 178 once they have been made.

14. If there are open ports or known good ports on the HCA, perform the following steps. Otherwise, goto the next step.a. Move the cable connector from the failing HCA port to the open or known good HCA port.

Clustering with high-performance computing by using InfiniBand hardware 177

Page 188: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

b. In order to see if the problem has been resolved, or it has moved to the new HCA port, use theprocedure in “Verifying link FRU replacements” on page 207. If the problem is fixed, go to step16.

c. If the problem was “fixed”, then the failing FRU is the HCA, replace the HCA by using theRepair and Verify procedures for the server and HCA. After the HCA has been replaced, go tostep 16.

d. If the problem was not “fixed”, then the failing FRU is the cable or the switch. Engage QLogicfor repair. Once the problem is fixed, go to step 16.

15. There are no open or available ports in the fabric, or the problem has not been isolated yet. Do thefollowing:a. Engage QLogic to replace the cable, and verify the fix by using the procedure in “Verifying link

FRU replacements” on page 207. If the problem is fixed, go to step 16.b. If the cable does not fix the problem, replace the HCA, and verify the fix by using the procedure

in “Verifying link FRU replacements” on page 207. If the problem is fixed, go to step 16.c. If the HCA does not fix the problem, engage QLogic to work on the switch. Once the problem is

fixed, go to step 16.16. If the problem has been fixed, run Fast Fabric Health check and check for .diff files. Be aware of any

inadvertent swapping of cables. For instructions on interpreting health check results, see “Healthchecking” on page 135.a. If the only difference between the latest cluster configuration and the baseline configuration is

new part numbers or serial numbers related to the repair action, run a new Health Checkbaseline to account for the changes.

b. If there are other differences between the latest cluster configuration and baseline configuration,perform the procedure in “Reestablishing a health check baseline” on page 207. This health checkbaseline will create a new baseline so that future health checks will not show configurationchanges.

c. If there were link errors reported in the health check, you need to go back to step 1 on page 176of this procedure and isolate the problem.

Related concepts

“Managing serviceable events on the HMC” on page 23

Diagnosing and repairing switch component problemsUse this procedure if you need to diagnose and repair switch component problems.

Switch internal problems can surface in the Cluster Systems Management/Management Server(CSM/MS) /var/log/csm/errorlog/[CSM/MS hostname] file or in Fast Fabric tools reports or health checks.

If a switch component problem is being reported, do the following:1. Contact QLogic with the log or report information. Or use the repair and troubleshooting procedures

in the Switch Users Guide or the QLogic Troubleshooting Guide.2. If any repair is made, or if anything is done to change the hardware or software configuration for the

fabric, use “Reestablishing a health check baseline” on page 207.

Diagnosing and repairing IBM system problemsSystem problems are most often reported on the Hardware Management Console (HMC) throughserviceable events. If an IBM system problem is reported, the repair action might affect the fabric.

Use the procedure found in “Restarting or powering off an IBM system” on page 209.

Diagnosing configuration changesUse the fast fabric health check to determine configuration changes in the fabric.

178 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 189: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Configuration changes in the fabric can be determined by using Fast Fabric Health Check. For details, see“Health checking” on page 135.v If you were directed here because you noted that HCA ports might have been swapped, see

“Diagnosing swapped HCA ports” on page 186.v If you have been directed here because you noted that switch ports might have been swapped, see

“Diagnosing swapped switch ports” on page 187.

Checking for hardware problems affecting the fabricUse this information to find out how to check for hardware problems that might affect the fabric.

To check for hardware problems that might affect the fabric, perform the following steps:1. Resolve open serviceable events on the HMC. If you have redundant HMCs configured, you only

need to resolve events on one HMC in each set of redundant HMCs. For details, see “Managingserviceable events on the HMC” on page 23

2. Check for switch or subnet manager errors on the Cluster Systems Management/Management Server(CSM/MS) in /var/log/csm/errorlog/[CSM/MS hostname] for any serviceable events that might not havebeen addressed, yet. Use the procedures in “Symptoms of problems” on page 154 to diagnoseproblems reported in this log. Look especially at Table 71 on page 154.

Note: If CSM Event Management is not set up, you can still use the previously mentioned table ofsymptoms. However, you need to go directly to the switch and subnet manager logs as they aredocumented in the vendors Switch Users Guide and Fabric Manager Users Guide.

3. Inspect the LEDs for the devices on the network and perform prescribed service procedures; seeTable 72 on page 155.

4. Look for driver errors that do not correspond to any hardware errors reported in SFP or the switchand subnet management logs. Perform appropriate service actions for the discovered error codes, orcontact your next level of support.For the AIX operating system, use errpt –a on the logical partitions that are exhibiting a performanceproblem.For the Linux operating system, look at /var/log/messages on the logical partitions that are exhibitinga performance problem.

Checking for fabric configuration and functional problemsUse this information to learn how to check for fabric configuration and functional problems.

To check for fabric configuration and functional problems, perform the following procedure.

On the fabric management server run the all_analysis fast fabric health check command. For details, see“Health checking” on page 135. To diagnose symptoms reported by health check see Table 73 on page156.

Note: The health check is most effective for checking for configuration problems if a baseline healthcheck has been performed and is stored in the /var/opt/iba/analysis/baseline directory on the fabricmanagement server. Otherwise changes in configuration cannot be sensed.If there is no baseline health check for comparison, you need to perform the same type of configurationchecks that were done during installation. For details, see “Installing and configuring the InfiniBandswitch” on page 116. For the host-based subnet managers, also use the “Installing the fabric managementserver” on page 91 topic. You need to check that the following configuration parameters match theinstallation plan. A reference or setting for IBM System p and IBM Power Systems HPC Clusters isprovided for each parameter that you check.

Clustering with high-performance computing by using InfiniBand hardware 179

Page 190: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 81. Health check parameters

Parameter Reference or setting

GID prefix The GID prefix must be different for each subnet. For details, see“Planning for global identifier prefixes” on page 35.

LMC Must be 2 for IBM HPC Clusters.

Maximum transfer unit (MTU) For details, see “Planning for maximum transfer units (MTUs)” on page34. This parameter is the fabric MTU and not the MTU in the stack,which can be a much greater number.

Cabling plan See the vendor Switch Users Guide and Planning and Installation Guide

Balanced Topology It is typically best to ensure that you have distributed the HCA portsfrom the servers in a consistent manner across subnets. For example, allcorresponding ports on HCAs within servers must connect to the samesubnet; similar to, all port 1 on HCA 1 must connect to subnet 1, andall port 2 on HCA 1 must connect to port 2.

Full bandwidth topology? Did you choose to implement a full-bandwidth topology by using thevendor recommendations found in the vendor Switch Users Guide andPlanning and Installation Guide?

Checking InfiniBand configuration in AIXUse the AIX operating system to verify that the host channel adapters (HCAs) are available andconfigured correctly.

Verifying that HCAs are visible to the logical partitions:

To verify that host channel adapters (HCAs) are visible to the logical partitions, perform the followingsteps:1. From the Cluster Systems Management/Management Server (CSM/MS), run the following command:

dsh -av "lsdev -Cc adapter | grep iba" | wc –l

2. Select from the following options:v If the number returned by the system matches the number of ibas in the cluster, continue with

“Verifying that all HCAs are available to the logical partitions” on page 181.v If the number returned by the system does not match the number of HCAs, continue with step 3.

3. Run the following command:dsh -av "lsdev -Cc adapter | grep iba" > iba_list

4. Open the generated file, iba_list and find the number of HCAs that are visible to the system. TheHCAs that are visible to the system are listed as Defined or Available.

5. For each logical partition that has HCAs that are not visible, check to see if the HCA was assigned tothat logical partition by performing the following steps:a. From the HMC that manages the server, verify that the HCA has been assigned to the logical

partition.b. If the HCA has not been assigned to the logical partition, see “Installing or replacing an InfiniBand

GX host channel adapter” on page 124.c. After you assign the HCA to the correct logical partition, run the following command:

dsh -av "lsdev Cc adapter | grep sn"

d. Select from the following options:v If the HCA is still not visible to the system, continue with the step 6.v If the HCA is visible to the system, continue with “Verifying that all HCAs are available to the

logical partitions” on page 181.6. If you have an HCA that was assigned to a logical partition but the HCA is not visible to the system:

180 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 191: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

a. Open the Manage Serviceable Events task on the HMC that manages each server and review theerror logs.

b. Fix any events that are reported against each server or HCAs in that server.c. Perform one of the following recovery procedures:v If all the interfaces in a logical partition are not configured, use the procedure in “Recovering all

the ibx interfaces in a logical partition in the AIX operating system” on page 199.v If only a single interface in a logical partition is not configured, use the procedure in

“Recovering a single ibx interface in AIX” on page 199.

Verifying that all HCAs are available to the logical partitions:

To verify that all HCAs are available to the logical partitions, perform the following steps:1. Run the following command:

dsh -av "lsdev -Cc adapter | grep ib | grep Available" | wc -l

2. Select from the following options:v If the number returned by the system matches the number of HCAs in the cluster, continue with

this procedure to verify all HCAs are available to the logical partitions.v If the number returned by the system does not match the number of HCAs, continue with this

procedure.3. Verify that all servers are powered on.4. Run the command:

dsh -av "lsdev -Cc adapter | grep sn | grep -v Available"

This command returns a list of HCAs that are visible to the system but not available.5. Restart logical partitions linked to an HCA that is listed as not available.6. Check the Manage Serviceable Events task on the HMC and the HPSNM for errors related to the

links associated with any HCA listed as not available.7. When all HCAs are listed as available to the operating system, continue with the procedure to verify

HCA numbering and the netid for the logical partition.8. Check HCA allocation across logical partitions. For HPC Cluster, there must only be one active

logical partition and the HCA must be Dedicated to it.9. Assure that the fabric is balanced across the subnets. The following command gathers the

GID-prefixes for the ib interfaces. These should be consistent across all logical partitions.dsh –av 'netstat -i | grep ’ib.*link’ | awk \’{split($4,a,"."); for(i=5;i<=12;i++){printf a[i]}; printf "\n"}\’’

10. Verify that the tcp_sendspace and tcp_recvspace attributes are set correctly:dsh –av “ibstat –v | grep 'tcp_send.*tcp_recv’”

Because superpackets should be on, the expected attribute value results are tcp_sendspace=524288and tcp_recvspace=524288.

Verifying that the IP maximum transfer unit (MTU) is configured correctly:

To verify that the Internet Protocol (IP) MTU is configured correctly, run the following command:dsh –av “netstat -i | grep ’ib.*link’ ” | awk ’{print $1” ”$2}’ | grep –v ”65532”

All ibx interfaces should be defined with superpacket=on, which results in an IP MTU of 65532. The IPMTU is different from the InfiniBand fabric MTU.

Verifying that the network interfaces are recognized as up and available:

To verify that the network interfaces are recognized as up and available, run the following command:dsh –av '/usr/bin/lsrsrc IBM.NetworkInterface Name OpState | grep -p"resource" -v "OpState = 1" | grep ib’

Clustering with high-performance computing by using InfiniBand hardware 181

Page 192: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

The previous command string should return no interfaces. If an interface is marked down, it returns thelogical partition and ibx interface.

Checking system configuration in the AIX operating systemYou can use the AIX operating system to check system configuration.Related concepts

“Managing serviceable events on the HMC” on page 23

Verifying the availability of processor resources:

1. Run the dsh -av "lsdev -C | grep proc | grep AVAILABLE" | wc –l command. The total number ofprocessors available in the cluster display.

2. If the total number of processors in the cluster does not appear, complete the following steps:a. Verify that all servers are powered onb. Check to see if the dsh command is locating all logical partitions, and fix problems if necessary.c. Determine which processors are having problems by running the dsh -av "lsdev -C | grep proc

| grep -v AVAILABLE" command.d. After you have identified the problem processor, check the Hardware Management Console

(HMC) that controls the server and complete the required service actions. If no serviceable eventsare found, try any isolation procedures for unconfigured processors that are found in the systemservice information.

e. When all processors are available, continue with the procedure to “Verifying the availability ofmemory resources.”

3. If processor unconfiguration persists, contact your next level of hardware support.4. Verify that processors are running at expected frequencies by using the dsh –av “/usr/pmapi/tools/

pmcycles –M” command.

Verifying the availability of memory resources:

To verify the availability of memory resources, perform the following steps:1. Run the following command:

dsh -av "lsattr -E -l mem0 | awk ’{ if (\$1 ~/goodsize/ ) { g=\$2}else { p=\$2 }}END{d=p-g; print d}’" | grep -v ": 0"

Note: The result of the awk parameter is the difference between physical memory and availablememory. Unless there is unconfigured memory, if you remove the grep -v ": 0" portion of thecommand, every logical partition returns 0.

2. If the operating system has access to all memory resources, a command prompt is shown withoutdata. You can exit the diagnostics.

3. If memory requires configuration, check the HMC for serviceable events referring to controlling thelogical partitions on the server and repair it as instructed.

Note: Before you perform a memory service action, ensure that the memory was not unconfiguredfor a specific reason. If the network still has performance problems contact your next level of support.

4. If no problems are found in the HMC, perform any service procedures for diagnosing unconfiguredmemory.

5. If the memory unconfiguration persists, contact your next level of support.

Checking InfiniBand configuration in LinuxUse the Linux operating system to verify that the host channel adapters (HCAs) are available andconfigured correctly.

Verifying that HCAs are visible to the logical partitions:

182 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 193: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

To verify that host channel adapters (HCAs) are visible to the logical partitions, perform the followingsteps:1. From the Cluster Systems Management/Management Server (CSM/MS), run the following command:

dsh -av "ibv_devices | grep ehca" | wc –l

2. Select from the following options:v If the number returned by the system matches the number of HCAs in the cluster, continue with

“Verifying that all HCAs are available to the logical partitions.”v If the number returned by the system does not match the number of HCAs, continue with step 3.

3. Run the following command:dsh -av "ibv_devices | grep ehca" > hca_list

A list of HCAs visible to the logical partitions plus their respective GUID will be generated.4. Open the generated file, hca_list and compare it with the list of all expected HCAs by their GUID.5. For each logical partition having HCAs that are not visible, check to see if the HCA was assigned to

that logical partition by performing the following steps:a. From the HMC that manages the server, verify that the HCA has been assigned to the logical

partition. If the HCA has not been assigned to the logical partition, see “Installing or replacing anInfiniBand GX host channel adapter” on page 124.If the device was not assigned to the logical partition, see “Installing the operating system andconfiguring the cluster servers” on page 111. After you assign the HCA to the logical partition,return to this step.

b. After you assign the HCA to the correct logical partition, run the following command:dsh -av "find /sys/bus/ibmebus/devices -name ’lhca*’ | wc -l"

c. Select from the following options:v If the HCA is still not visible to the system, continue with the step 6.v If the HCA is visible to the system, continue with step “Verifying that all HCAs are available to

the logical partitions.”6. If you have an HCA that is assigned to a logical partition, but the HCA is not visible to the system,

perform the following steps:a. Open the Manage Serviceable Events task on the HMC that manages each server and review the

error logs.b. Fix any events that are reported against each server or HCAs in that serverc. Perform one of the following recovery procedures:v If all the interfaces in a logical partition are not configured, use the procedure in “Recovering all

of the ibx interfaces in a logical partition in the Linux operating system” on page 201.v If only a single interface in a logical partition is not configured, use the procedure in

“Recovering a single ibx interface in Linux” on page 201.

Verifying that all HCAs are available to the logical partitions:

To verify that all HCAs are available to the logical partitions, perform the following steps:1. Run the following command:

dsh -av "ibv_devinfo | grep PORT_ACTIVE" | wc -l

The total number of ports that are active are shown. Note that HCA has two ports.2. Select from the following options:v If the number returned by the system, divided by two, matches the number of HCAs in the

cluster, continue with the procedure to verify that all HCAs are available to the logical partitions.v If the number returned by the system, divided by two, does not match the number of HCAs,

determine the inactive ports and check their cabling state by following step 3 on page 184.

Clustering with high-performance computing by using InfiniBand hardware 183

Page 194: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

v If the number returned by the system, divided by two, does not match the number of HCAs andthe ports are correctly connected, continue with step 5.

3. Verify that all ports are active by running the command:dsh -av "ibv_devinfo | egrep 'hca_id|node_guid|port:|PORT_DOWN’"

4. For each port listed by the system, ensure that the respective cable is connected firmly to the adapteras well as with the switch.v You might want to consider enabling the auto-port-detection feature of eHCA, especially if there

are ports unused by purpose. In order to enable that feature add the following line to the file/etc/modprobe.conf.local:options ib_ehca nr_ports=-1

v In order get a full list of supported options run the command: modinfo ib_ehca

5. Verify that all servers are powered on6. Run the command:

dsh -av "lsdev -Cc adapter | grep sn | grep -v Available"

A list of HCAs that are visible to the system but not available is shown.7. Reboot any logical partition linked to an HCA that is listed as not available.8. Check SFP and HPSNM for errors related to the links associated with any HCA listed as not

available.9. When all HCAs are listed as available to the operating system, continue with the procedure to verify

HCA numbering and the netid for logical partition.10. Check HCA allocation across logical partitions. For HPC Cluster, there should only be one active

logical partition and the HCA should be Dedicated to it.11. Assure that the fabric is balanced across the subnets. The following command string gathers the

GID-prefixes for the ib interfaces. The GID-prefixes should be consistent across all logical partitions.dsh –av 'netstat -i | grep ’ib.*link’ | awk \’{split($4,a,"."); for(i=5;i<=12;i++){printf a[i]}; printf "\n"}\’’

12. Verify that the tcp_sendspace and tcp_recvspace attributes are set correctly:[Nam] Is this send_queue_size and recv_queue_size from ipoib?

[Mark] Yes

dsh –av “ibstat –v | grep 'tcp_send.*tcp_recv’”

Because superpackets should be on, the expected attribute value results are tcp_sendspace=524288and tcp_recvspace=524288.

Verifying that the IP maximum transfer unit (MTU) is configured correctly:

To verify that the Internet Protocol (IP) MTU is configured correctly, perform the following steps:1. Run the following command:

dsh –av “find /sys/class/net -name ’ib*’ | xargs -I dn cat dn/mtu”

2. Select from the following options:v If the MTU returned matches the expected MTU value, continue with “Verifying that the network

interfaces are recognized as up and available.”v If the MTU returned does not match the expected MTU value, continue with step 3.

3. For each HCA ibX having the wrong MTU, run the command on the respective logical partition:echo <right value> > /sys/class/net/ibX/mtu

Verifying that the network interfaces are recognized as up and available:

To verify that the network interfaces are recognized as up and available, run the following command:dsh –av '/usr/bin/lsrsrc IBM.NetworkInterface Name OpState | grep -p"resource" -v "OpState = 1" | grep ib’

184 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 195: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

The following command string should return no interfaces. If an interface is marked down, it returns thelogical partition and ibX interface.

Checking system configuration with LinuxYou can check your system configuration with the Linux operating system.

Verifying the availability of processor resources:

To verify the availability of processor resources, perform the following steps:1. Run the following command:

dsh -av "grep processor /proc/cpuinfo" | wc –l

The total number of processors available in the cluster is shown.2. If the total number of processors available in the cluster is not shown, perform the following steps:

a. Verify that all servers are powered onb. Fix any problems with the dsh command not being able to reach all logical partitionsc. Determine which processors are having problems by running the following command:

dsh -av "lsdev -C | grep proc | grep -v AVAILABLE"

d. After you have identified the processors that are having problems, check the HMC controlling theserver and complete any required service actions. If no serviceable events are found, try anyisolation procedures for unconfigured processors that are found in the Start of service call.

e. When all processors are available, continue with the procedure to verify memory.3. If processor de-configuration persists, contact your next level of hardware support.4. Verify processors are running at expected frequencies by using the following command:

dsh –av “egrep 'processor|clock’ /proc/cpuinfo”

Verifying the availability of memory resources:

To verify the availability of memory resources, perform the following steps:1. Run the following command:

dsh –av “grep MemTotal /proc/meminfo”

2. Select from the following options:v If the operating system has access to all memory resources, a command prompt is shown without

any data. You can now exit this procedure.v If memory requires configuration, check the HMC controlling the server logical partition and

perform any service actions as instructed

Note: Before you perform a memory service action, make certain that the memory was notunconfigured for a specific reason.

3. If no problems are found in the HMC, refer to the Start of service call and follow the instructions fordiagnosing unconfigured memory.

4. If the memory de-configuration persists, contact your next level of support.

Checking multicast groupsUse this procedure to check multicast groups for proper membership.

To check multicast groups for proper membership, perform the following procedure:1. If you are running a host-based subnet manager, to check multicast group creation, on the fabric

management server run a query for the specific subnet. Remember that you must provide the HCAand port through which the subnet manager connects to the subnet./sbin/saquery –o mcmember –h [HCA] –p [HCA port]

Clustering with high-performance computing by using InfiniBand hardware 185

Page 196: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Each interface produces an entry like the following example. Note the 4 KB maximum transmissionunit (MTU) and 20 g rate.GID: 0xff12601bffff0000:0x0000000000000016PortGid: 0xfe80000000000000:0002550070010f00MLID: 0xc004 PKey: 0xffff Mtu: 4096 Rate: 20g PktLifeTime: 2147 msQKey: 0x00000000 SL: 0 FlowLabel: 0x00000 HopLimit: 0xff TClass: 0x00

Note: You can check for misconfigured interfaces by using something like the following example,which looks for any MTU that is not 4096 or rate is 10 g:/sbin/saquery –o mcmember –h [HCA] –p [port] | egrep –B 3 –A 1 'Mtu: [0-3]|Rate: 10g’

2. If you are running an embedded subnet manager, to check multicast group creation, run the followingcommand on each switch with a master subnet manager. If you have set it up, you might use the dshcommand from the Cluster Systems Management/Management Server (CSM/MS) to the switches. Fordetails, see “Setting up remote command processing” on page 103. Remember to use --devicetypeIBSwitch::Qlogic when pointing to the switches.smShowGroups

There should be just one group with all the HCA devices on the subnet being part of the group.

Note: MTU=5 indicates 4 KB. MTU=4 indicates 2 KB.The following example shows 4 KB MTU.0xff12401bffff0000:00000000ffffffff (c000)

qKey = 0x00000000 pKey = 0xFFFF mtu = 5 rate = 3 life = 19 sl = 00x00025500101a3300 F 0x00025500101a3100 F 0x00025500101a8300 F0x00025500101a8100 F 0x00025500101a6300 F 0x00025500101a6100 F0x0002550010194000 F 0x0002550010193e00 F 0x00066a00facade01 F

Diagnosing swapped HCA portsUse this procedure to diagnose swapped host channel adapter (HCA) ports.

If you swap ports, it might be inconsequential or it might cause performance problems, depending onwhich ports were swapped. An in-depth analysis of whether a swap can cause performance problems isoutside of the scope of this document. However, a rule of thumb applied here is that swapping portsbetween subnets is not desirable.

If HCA ports have been swapped, the swap will show by the Fast Fabric Health Check when it comparesthe latest configuration with the baseline configuration. You need to interpret the diff output between thelatest and baseline configuration to see if a port swap has occurred.

In general, when HCA ports are swapped, they are swapped on the same HCA, or perhaps on HCAswithin the same IBM server. Any more sophisticated swapping would likely be up for debate withrespect to if it is a switch port swap or an HCA port swap, or just a complete reconfiguration.

You might need to reference the Fast Fabric Toolset Users Guide for details on health checking.

Note: This assumes that a baseline health check has been taken previously; see “Health checking” onpage 135.1. Run the all_analysis command.2. Go to /var/opt/iba/analysis/latest (default output directory structure)3. Look for fabric.X:Y.links.diff, where X is the HCA and Y is the HCA port on the fabric management

server that is attached to the subnet. This helps you map directly to the subnet with the potentialissue. Presumably, this is not the same HCA which you are trying to diagnose.

4. If there is no fabric.X:Y.links.diff file, there is no port swap. Exit this procedure.5. If there is a fabric.X:Y.links.diff, there might be a port swap. Continue to the next step.

186 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 197: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

6. Use the procedure in “Interpreting .diff files” on page 141 and the procedures in the Fast Fabric ToolsetUsers Guide to interpret the .diff file.

7. If you intended to swap ports, do the following steps. Otherwise, go to the next step.a. You will need to take another baseline so that future health checking will not fail. Use the

procedure in “Reestablishing a health check baseline” on page 207.b. Inspect the cable labels. If necessary, change them to reflect the latest configuration.

8. If you did not intend to swap ports, swap them back and go back to the beginning of this procedureto verify that you have been successful in swapping the ports back to their original configuration.

Diagnosing swapped switch portsSwapping of ports might be inconsequential or it might cause performance problems; it depends onwhich ports get swapped.

An in-depth analysis of whether a swap can cause performance problems is outside of the scope of thisdocument. However, a rule of thumb applied here is that swapping ports between subnets is notdesirable.

If switch ports have been swapped, this will be uncovered by the Fast Fabric Health Check when itcompares the latest configuration with the baseline configuration. You need to interpret the diff outputbetween the latest and baseline configuration to see if a port swap has occurred.

In general, when switch ports are swapped, they are swapped between ports on the same switch chassis.Switch ports that appear swapped between switch chassis could be caused by swapping HCA ports onan HCA or between ports in the same IBM server. Any more sophisticated swapping would likely be upfor debate with respect to if it is a switch port swap or an HCA port swap, or just a completereconfiguration.

You might need to reference the Fast Fabric Toolset Users Guide for details on health check.1. Run the all_analysis command.2. Go to the /var/opt/iba/analysis/latest directory (default output directory structure).3. Look for the fabric.X:Y.links.diff file, where X is the HCA and Y is the HCA port on the fabric

management server that is attached to the subnet. This helps you map directly to the subnet with thepotential issue.

4. If there is no fabric.X:Y.links.diff file, there is no port swap. Exit this procedure.5. If there is a fabric.X:Y.links.diff, there might be a port swap. Continue to the next step.6. Use the procedure in “Health checking” on page 135 and the procedures in the Fast Fabric Toolset

Users Guide to interpret the .diff file.7. If you intended to swap ports, perform the following steps. Otherwise, go to the next step.

a. You will need to take another baseline so that future health checks will not fail. Use the procedurein “Reestablishing a health check baseline” on page 207.

b. Inspect the cable labels. If necessary, change them to reflect the latest configuration.8. If you did not intend to swap ports, swap them back, and go back to the beginning of this procedure

to verify that you have been successful in swapping the ports back to their original configuration.

Diagnosing performance problemsUse this procedure to isolate performance problems.

Performance degradation can result from several different problems, including the following items:v A hardware failurev Installation problemsv Configuration issues

Clustering with high-performance computing by using InfiniBand hardware 187

Page 198: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Before contacting your next level of service, do the following to isolate a performance problem.1. Check for hardware problems. For details, see “Checking for hardware problems affecting the fabric”

on page 179.2. Check for fabric configuration problems. For details, see “Checking for fabric configuration and

functional problems” on page 179.3. Check for configuration problems in the IBM systems, by performing the following steps:

Note: During these steps, you check for HCA availability, CPU availability, and memory availability.a. For AIX logical partitions, see:

1) “Checking InfiniBand configuration in AIX” on page 1802) “Checking system configuration in the AIX operating system” on page 182

b. For Linux logical partitions, see:1) “Checking InfiniBand configuration in Linux” on page 1822) “Checking system configuration with Linux” on page 185

4. If performance problems persist, contact your next level of support.

Diagnosing and recovering ping problemsIf you have problems when you ping between Internet Protocol (IP) Network Interfaces (ibX), you willneed to check the fabric configuration parameters and the host channel adapter (HCA) configuration toensure that the problem is not caused by a faulty configuration.

Check the IBM clusters with the InfiniBand switch Web site for any known issues or problems that wouldaffect the IP Network Interfaces.

To recover from the problem, perform the following steps:1. Ensure that the device drivers for the HCAs are at the latest level. This is especially important for any

fixes that would affect the IP. Check the IBM Clusters with the InfiniBand Switch Web site referenced in“General cluster information resources” on page 3.

2. Check the IBM clusters with the InfiniBand switch Web site for any known issues or problems thatwould affect the IP Network Interfaces. Make any required changes.

3. Look for hardware problems by using the “Checking for hardware problems affecting the fabric” onpage 179 topic.

4. Check the HCA configuration for the interfaces that cannot ping:v For the AIX operating system, see “Checking InfiniBand configuration in AIX” on page 180.v For the Linux operating system, see “Checking InfiniBand configuration in Linux” on page 182.

5. Check for fabric configuration and functional problems by using the “Checking for fabricconfiguration and functional problems” on page 179 topic.

6. Check multicast group membership at the subnet managers by using the procedure in “Checkingmulticast groups” on page 185. If there is a problem, recreate the problem interface or interfaces asdescribed in one of the following procedures:v For the AIX operating system and ibx interfaces, see “Recovering ibx interfaces” on page 199.v For the Linux operating system and ehcax interfaces, see “Recovering ehcax interfaces” on page 201

7. Reboot the logical partitions. If rebooting resolves the problem, contact your next level of support.8. Recycle the subnet managers. If recycling resolves the problem, contact your next level of support.

a. Bring down the Fabric Managers on all fabric management servers: /etc/init.d/ iview_fm stop

Verify that the subnet manager is stopped by running: ps –ef|grep iview

b. Restart the Fabric Managers on all fabric management servers: /etc/init.d/ iview_fm start

Diagnosing application crashesUse this procedure to diagnose application crashes.

188 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 199: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Diagnosing application crashes with respect to the cluster fabric is like diagnosing performance problemsas in “Diagnosing performance problems” on page 187. However, if you know the endpoints involved inthe application crash, you can check the state of the routes between the two points to see if there mightbe an issue. You do this check with the Fast Fabric command:iba_report -o route -D <destination> -S <source>

There are many ways to format the destination and route query. Only a few examples are shown here.The Fast Fabric Users Guide has more details.

For a particular HCA port to HCA port route query, it is suggested that you use the NodeGUIDs:iba_report -o route -D nodeguid:<destination NodeGUID> -S nodeguid:<sourceNodeGUID>

You can find the node GUIDs by using the “Mapping of IBM HCA GUIDs to physical HCAs” on page162 topic. Instead of doing as instructed and grepping for only the first 7 bytes of a node GUID, you canconsider recording all 8 bytes. You can use iba_stat –n for HCAs in AIX logical partitions andibv_devinfo –v for HCAs in Linux logical partitions.

If you have a particular logical partition for which you want to determine routes, you can use aportGUID instead:iba_report -o route -D portguid:<destination portGUID> -S nodeguid:<portNodeGUID>

You can find the portGUIDs by using the procedure in “Mapping of IBM HCA GUIDs to physical HCAs”on page 162. You can use ibstat –p for HCAs in AIX logical partitions and ibv_devinfo –v for HCAs inLinux logical partitions.

If after completing this procedure you do not have a solution, see “Diagnosing performance problems”on page 187.

Diagnosing management subsystem problemsUse these procedures to debug management subsystem problems. The procedures concentrate onIBM-vendor management subsystem integration issues. Individual units and applications are not coveredhere.

Determining problems with event management or remote syslogging:

Use this procedure to help you determine where to look when expected events are not appearing in logs.

For details on the flow of logs, see “Vendor log flow to CSM event management” on page 26.

Notes:

1. The term source is used in this section to generically refer to where the log entry should have beenoriginally logged. This will typically either be a fabric management server (for host-based subnetmanager logs) or a switch (for switch chassis logs, or embedded subnet manager logs).

2. Before proceeding, verify that you can ping between the Cluster Systems Management/ManagementServer (CSM/MS) and the source of the log entry. If you cannot ping, then there might be a problemwith the Ethernet network between the CSM/MS and the source.

3. If you are using CSM on SLES10 (SP1 or higher), you must ensure that syslog-ng is given read-writepermission through AppArmor to the named-pipe /var/log/csm/syslog.fabric.notices.

If you have a problem with event management or remote syslogging detecting subnet manager events orswitch events, use this procedure. Start with the table of symptoms below.

Clustering with high-performance computing by using InfiniBand hardware 189

Page 200: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Symptom Procedure

Event is not in the /var/log/csm/errorlog/[CSM/MShostname] on the CSM/MS

“Event not in the CSM/MS /var/log/csm/errorlog file”

Event is not in /var/log/csm/syslog.fabric.notices onthe CSM/MS

“Event not in the CSM/MS /var/log/csm/syslog.fabric.notices file” on page 191

Event is not in /var/log/csm/syslog.fabric.info on theCSM/MS

“Event not in the CSM/MS /var/log/csm/syslog.fabric.info file” on page 194

Event is not in the log on the fabric management server “Event not in log on fabric management server” on page195

Event is not in the log on the switch “Event not in switch log” on page 196

Event not in the CSM/MS /var/log/csm/errorlog file:

Use this procedure if an expected event is not in the Cluster Systems Management/Management Server(CSM/MS) /var/log/csm/errorlog log file.

If an expected event is not in the /var/log/csm/errorlog/CSM/MS hostname file, perform the followingsteps:1. Log on to the CSM/MS.2. Look at the log for the device that is logging the problem.v For the fabric management server, look at the /var/log/messages file.v For switches, log on to the switch and look at the log. If necessary use the switch command-line

help, or the switch Users Guide for instructions.3. Verify that you can ping the source, which can be either the fabric management service VLAN IP

address for the server or the switch.

Note: If you cannot ping the source device, then you must use standard network debug techniques toisolate the problem on the service VLAN. Consider, the CSM/MS connection, the fabric managementserver connection, the switch connection, and any Ethernet devices on the network. Also, ensure thatthe addressing has been set up correctly.

4. Select from the following options:v If you are using CSM on AIX, open the file that Event Management is monitoring on the CSM/MS

and look for the log entry. This file is /var/log/csm/syslog.fabric.notices. If it is not in there, goto “Event not in the CSM/MS /var/log/csm/syslog.fabric.notices file” on page 191.

v If you are using CSM on Linux, use the tail command to check the file that Event Management ismonitoring on the CSM/MS and look for the log entry. This file is /var/log/csm/syslog.fabric.notices. If it is not in there, go to “Event not in the CSM/MS /var/log/csm/syslog.fabric.notices file” on page 191.

Note: The tail command only shows results if the /var/log/csm/errorlog/[CSM/MS hostname] fileis empty, and the syslog daemon tried to write to the following file: /var/log/csm/syslog.fabric.notices.

5. Check the event management sensor-condition-response setup. For details, see the CSM CommandsReference Guide and the man pages. You might also need to reference the CSM Administration Guide forgeneral information about Event Management.The following table shows which sensors, conditions, and responses apply to various CSMconfigurations:

190 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 201: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 82. Sensors, conditions, and responses for CSM configurations

CSM configurations Sensor Condition Response

CSM on AIX and CSM/MSis not a managed node

AIXSyslogSensor LocalAIXNodeSyslog LogNodeErrorLogEntry

BroadcastEventsAnyTime(optional)

CSM on AIX and CSM/MSis a managed node

AIXSyslogSensor AIXNodeSyslog LogNodeErrorLogEntry

BroadcastEventsAnyTime(optional)

CSM on Linux andCSM/MS is not a managednode

ErrorLogSensor LocalNodeAnyLoggedError LogNodeErrorLogEntry

BroadcastEventsAnyTime(optional)

CSM on Linux andCSM/MS is a managednode

ErrorLogSensor AnyNodeAnyLoggedError LogNodeErrorLogEntry

BroadcastEventsAnyTime(optional)

a. Ensure that the sensor is set up by using the /usr/bin/lssensor file.

Notes:

v Use the /usr/bin/lssensor file without a parameter to see which sensors are set up.v Use the /usr/bin/lssensor file with the wanted sensor name to see where the sensor is being

run.v Unless you set it up differently, ensure that the /usr/bin/lssensor file is sensing the

/var/log/csm/syslog.fabric.notices log.v If there is a problem with the setup of the sensor, recover it by using the procedure in

“Reconfiguring the CSM event management” on page 196.b. Ensure that the condition is set up with the /usr/bin/lscondition file.v Use the /usr/bin/lscondition file without a parameter to check the state of the various

conditions (monitored or not monitored).v Use the /usr/bin/lscondition file with the specific condition as a parameter. The

SelectionString parameter shows which sensor is monitoring.v Ensure that the condition is associated with the sensor.

c. Ensure that the response is linked to the condition with the /usr/bin/lscondresp file.v Use the /usr/bin/lscondresp file without a parameter to see the complete list of

condition-response combinationsv Use the /usr/bin/lscondresp file with a specific condition as a parameter and you will get a list

of responses associated with that condition.v Ensure that the response and condition are linked.

6. You might have to restart the Reliable Scalable Cluster Technology (RSCT) subsystem according to theRSCT Users Guide.

7. If the problem has not been fixed, contact your next level of support.

Event not in the CSM/MS /var/log/csm/syslog.fabric.notices file:

Use this procedure if an expected event is not in the Cluster Systems Management/Management Server(CSM/MS) /var/log/csm/syslog.fabric.notices log file.

If an expected event is not in the remote syslog file for notices on the Cluster SystemsManagement/Management Server (CSM/MS) /var/log/csm/syslog.fabric.notices, perform the following steps.

Clustering with high-performance computing by using InfiniBand hardware 191

Page 202: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Note: This procedure is documented for using the syslogd command for syslogging. If you are usinganother syslog application, such as the syslog-ng command, then you might have to alter this procedure.However, the underlying technique for debugging is the same.1. Log on to the CSM/MS.2. Verify that you can ping the source, which can be either the fabric management service VLAN IP

address for the server or the switch.

Note: If you cannot ping the source device, then you must use standard network debug techniquesto isolate the problem on the service VLAN. Consider, the CSM/MS connection, the fabricmanagement server connection, the switch connection, and any Ethernet devices on the network.Also, ensure that the addressing has been set up correctly.

3. Select from the following options:v If you are using CSM on Linux, continue with step 4.v If you are using CSM on AIX, continue with step 5.

4. If you are using CSM on Linux, check the configuration of the AppArmor application with thesyslog-ng command to ensure that the /var/log/csm/syslog.fabric.notices wr, variable is in the/etc/apparmor.d/sbin.syslog-ng file.v If the /var/log/csm/syslog.fabric.notices wr, variable is in the log file, continue with step 5.v If the /var/log/csm/syslog.fabric.notices wr, variable is not in the log file, perform the

following steps:a. Add the line /var/log/csm/syslog.fabric.notices wr, to the /etc/apparmor.d/sbin.syslog-ng file

before the right bracket (}). You must put the comma at the end of the line.b. Restart the AppArmor application by using the /etc/init.d/boot.apparmor restart file.c. Restart the syslog-ng command by using the /etc/init.d/syslog restart file.d. If this action resolves the problem, this ends the procedure. Otherwise, continue with step 5.

5. Check the syslog configuration file and verify that the following entry is in there.a. If the CSM/MS is running AIX, it is using syslog (not syslog-ng) and the following line should

be in the /etc/syslog.conf file. If /etc/syslog.conf does not exist, go to step 4b. Otherwise,after finishing this step, go to step 6 on page 193.# all local6 notice and above priorities go to the following filelocal6.notice /var/log/csm/syslog.fabric.notices

b. If the CSM/MS is running SLES 10 SP1 or higher, it is using syslog-ng and the following lineswill be in the /etc/syslog-ng/syslog-ng.conf file:

Note: The actual destination and log entries might vary slightly if they were set up by using themonerrorlog command. Because it is a generic CSM command being used for InfiniBand, themonerrorlog command uses a different name from fabnotices_fifo in the destination and logentries. It is a pseudo random name that is like fifonfJGQsBw.filter f_fabnotices { facility(local6) and level(notice, alert, warn,

err, crit) and not filter(f_iptables); };destination fabnotices_fifo { pipe("/var/log/csm/syslog.fabric.notices"

group(root) perm(0644)); };log { source(src); filter(f_fabnotices); destination(fabnotices_fifo); };

The following udp and tcp definitions are in the src stanza.

udp(ip("0.0.0.0") port(514));tcp(ip("0.0.0.0") port(514));

Note: If the fabric management server is only using User Datagram Protocol (UDP) as thetransfer protocol for log entries, then the Transmission Control Protocol (TCP) line is not needed.Step 7 on page 193 indicates how to check this. In either case, make note of the protocols andports and IP addresses in these lines. Using the 0.0.0.0 address accepts logs from any address. Ifyou want more security, you might have a line for each switch and fabric management server

192 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 203: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

from which you want to receive logs. If you have a specific address named, ensure that thesource of the log has an entry with its address. Switches use UDP, fabric management servers areconfigurable for TCP or UDP. In any case, ensure that the UDP line is always used.

6. If the entries are not there, perform the procedure in “Reconfiguring the CSM event management”on page 196. If the procedure fixes the problem, this ends the procedure. If the entries are there, goto the next step.

7. Look at the log on the device that is logging the problem and make sure that it is there.a. For the fabric management server, look at the /var/log/messages fileb. For switches, log on to the switch and look at the log. If necessary use the switch command-line

help, or the switch Users Guide for instructions.8. If the setup on the CSM/MS is valid and the log entry is in the log for the source, check to see that

the source is set up for remote logging:a. For a fabric management server running the syslog command (not the syslog-ng command),

check the /etc/syslog/syslog.conf file for the following line:

Note: If the /etc/syslog.conf file does not exist, go to step 7c.local6.* @CSM/MS IPp-address

Note: If you make a change, restart the syslogd command by using the /etc/init.d/syslogrestart command.

b. Continue with step 9.c. For a fabric management server running the syslog-ng command, check the /etc/syslog-ng/

syslog-ng.conf file for the following lines: Ensure that the destination definition uses the sameprotocol and port as is expected on the CSM/MS; the definition shown here is UDP on port 514.The CSM/MS information should have been noted in step 4 on page 192. The standard syslogduses UDP.filter f_fabinfo { facility(local6) and level(info, notice, alert, warn,

err, crit) and not filter(f_iptables); };destination fabinfo_csm { udp("[CSM/MS IP-address]" port(514)); };

log { source(src); filter(f_fabinfo); destination(fabinfo_csm); };

Note: If you make a change, you will have to restart the syslogd command by using the/etc/init.d/syslog restart file.

d. For a switch, check that it is configured to log to the CSM/MS by using the logSyslogConfigcommand on the switch command line. Check that the following information is correct. If it isnot, update it by using:logSyslogConfig –h [host] –p 514 –f 22 –m 1

v The CSM/MS is the host IP address.v The port is 514 (or another one you have chosen to use).v The facility is local6.

9. If the problem persists, then try restarting the syslogd command on the CSM/MS and also resettingthe source's logging:a. Log on to the CSM/MS.b. For AIX CSM, run the refresh -s syslogd command.c. For Linux CSM, run the /etc/init.d/syslog restart command.d. If the source is Subnet Manger running on a fabric management server, log on to the fabric

management server and run the /etc/init.d/syslog restart command.e. If the source is a switch, reboot the switch by using the instructions in the Switch Users Guide (by

using the reboot command on the switch CLI), or the Fast Fabric Users Guide (by using the ibtestcommand on the fabric management server).

10. If the problem has not been fixed, contact your next level of support

Clustering with high-performance computing by using InfiniBand hardware 193

Page 204: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Event not in the CSM/MS /var/log/csm/syslog.fabric.info file:

Use this procedure if an expected event is not in the Cluster Systems Management/Management Server(CSM/MS) error log.

If an expected event is not in the remote syslog file (/var/log/csm/syslog.fabric.info), perform the followingsteps:

Note: This procedure is documented for the syslogd command for syslogging. If you are using anothersyslog application, such as syslog-ng, then you might have to alter this procedure. However, theunderlying technique for debugging remains the same.1. Log on to the CSM/MS.2. Verify that you can ping the source, which can be either the cluster VLAN IP address for the fabric

management server or the switch.

Note: If you cannot ping the source device, then you can use standard network debugging techniquesto isolate the problem on the service VLAN. Consider the CSM/MS connection, the fabricmanagement server connection, the switch connection, and any Ethernet devices on the network. Also,ensure that the addressing has been set up correctly.

3. Check the syslog configuration file and verify that the following entry is in there.a. If the CSM/MS is using syslog (not syslog-ng), the following line must be in /etc/syslog.conf. If

/etc/syslog.conf does not exist, go to step 3b.# all local6 info and above priorities go to the following filelocal6.info /var/log/csm/syslog.fabric.info

b. If the CSM/MS is using syslog-ng, the following lines must be in /etc/syslog-ng/syslog-ng.conf:filter f_fabinfo { facility(local6) and level(notice, alert, warn,

err, crit) and not filter(f_iptables); };destination fabnotices_fifo { pipe("/var/log/csm/syslog.fabric.notices"

group(root) perm(0644)); };log { source(src); filter(f_fabnotices); destination(fabnotices_fifo); };

udp(ip("0.0.0.0") port(514));tcp(ip("0.0.0.0") port(514));

Note: If the fabric management server is only using UDP as the transfer protocol for log entries,then the TCP line is not needed. Step 6 indicates how to check this whether you use UDP or TCP.In either case, note the protocols and ports and IP addresses in these lines. Using 0.0.0.0 acceptslogs from any address. If you want more security, you might have a line for each switch and fabricmanagement server from which you want to receive logs. If you have a specific address named,ensure that the source of the log has an entry with its address. Switches use UDP. Fabricmanagement servers are configurable for TCP or UDP.

4. If the entries are not there, perform the following steps:a. Edit the /etc/syslog.conf (or syslog-ng.conf) file and add it to end of the file.b. Restart the syslogd command. For AIX hosts, run the refresh -s syslogd command. For Linux

hosts, run the /etc/init.d/syslog restart command.5. Ensure there is a log on the device that is reporting a problem.

a. For the fabric management server, look at the /var/log/messages fileb. For switches, log on to the switch and look at the log. If necessary use the switch command-line

help, or the switch Users Guide for instructions.6. If the setup on the CSM/MS has proven to be good and the log entry is in the log for the source,

check to see that the source is set up for remote logging. To check for remote logging, long on to thesource and check for the following information:

194 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 205: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

a. For a fabric management server running syslog (not syslog-ng), check the /etc/syslog/syslog.conf file for the following line. If /etc/syslog.conf does not exist, continue with step 6b.local6.* @[put CSM/MS IPp-address]

Note: If you make a change, you need to restart the syslogd command.b. For a fabric management server running syslog-ng, check the /etc/syslog-ng/syslog-ng.conf file

for the following lines. Ensure that the destination definition uses the same protocol and port as isexpected on the CSM/MS. The definition shown here is UDP on port 514. The CSM/MSinformation should have been noted in step 3 on page 194. The standard syslogd command usesUDP. Other commands, such as the syslog-ng command, might use either TCP or UDP.filter f_fabinfo { facility(local6) and level(info, notice, alert, warn,

err, crit) and not filter(f_iptables); };destination fabinfo_csm { udp("[CSM/MS IP-address]" port(514)); };log { source(src); filter(f_fabinfo); destination(fabinfo_csm); };

Note: If you make a change, you need to restart the syslogd command.c. For a switch, check that it is configured to log to the CSM/MS by using logSyslogConfig on the

switch command line. Check that the following information is correct:v The CSM/MS is the host IP addressv The port is 514 (or other that you have chosen to use)v The facility is 22v The mode is 1

7. If the problem persists, try restarting the syslogd command on the CSM/MS and resetting the loggingon the source:a. Log on to the CSM/MS.b. For AIX hosts, run the refresh -s syslogd command.c. For Linux hosts, run the /etc/init.d/syslog restart command.d. If the source is the fabric management server, use the /etc/init.d/syslog restart command.e. If the source is a switch, reboot the switch by using the instructions in the Switch Users Guide.

8. If the problem has not been fixed, contact your next level of support.

Event not in log on fabric management server:

Use this procedure if an expected log entry is not in the log on the fabric management server.

If the expected log entry /var/log/messages is not in the log for the fabric management server, performthe following steps:

Note: The syslogd command for syslogging is documented in this procedure. If you are using anothersyslog application, such as the syslog-ng application, then you might have to alter this procedure.However, the underlying technique for debugging remains the same.1. Log on to the fabric management server.2. Open the /var/log/messages file and look for the expected log entry.3. Select from the following options:v If the log entry is in the /var/log/messages file, the problem is not with the log on the fabric

management server. This ends the procedure.v If the log entry is not in the syslog file for the source, then the problem is with the logging

subsystem. Continue with step 4.4. If this log was a test log entry using the logger command, or a similar command, check the syntax

and try the command again if it was incorrect.

Clustering with high-performance computing by using InfiniBand hardware 195

Page 206: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

5. If the source is the fabric management server, check to make sure that the syslogd command isrunning by using the ps command. If the syslogd command is not running, start it by using/etc/init.d/syslog start.

6. If you are missing subnet manager logs, verify that the Fabric Manager is running, and start it if it isnot. For details, see the Fabric Manager Users Guide from the vendor.

7. If the syslogd command is running, the subnet manager is running, and you did not have a problemwith syntax for the logger command, then try restarting the syslogd command by using the/etc/init.d/syslog restart command.

8. Verify that there is an entry in the syslog.conf file or the syslog-ng.conf file that directs logs to the/var/log/messages file.

9. If the fabric management server is still not logging correctly, try troubleshooting techniquesdocumented for the syslogd command in the operating system documentation. If the problem persists,contact your next level of support.

Event not in switch log:

Use this procedure if an expected event is not in the switch log.

If the expected log entry is not in the switch log, do the following:1. Log on to the switch and look at the log by using the command in the vendors Switch Users Guide or

found in the command-line help.2. If you are expecting subnet manager log entries in the log, and they are not there, then start the

subnet manager by using the instructions found in the vendors Switch Users Guide or found in thecommand-line help.

3. If there is still a problem with logging on a switch, contact your next level of support.

Reconfiguring the CSM event management:

This procedure is used to reconfigure a Cluster Systems Management (CSM) event managementenvironment that has lost its original configuration.

When a CSM event management environment loses its configuration, it might be necessary tounconfigure it and reconfigure it. The procedure to use depends on whether the CSM is running on theAIX operating system or the Linux operating system.

Reconfiguring CSM on the AIX operating system:

To reconfigure CSM event management on the AIX operating system, complete the following steps.1. Log on to the Cluster Systems Management/Management Server (CSM/MS).2. Run the lscondresp command to determine which condition and responses you are using. The

typical condition name is either LocalAIXNodeSyslogError for a CSM/MS that is not a managednode, or AIXNodeSyslogError for a CSM/MS that is a managed node. The typical response name istypically LogNodeErrorLogEntry. The BroadcastEventsAnyTime condition can also be configured.Finally, the system administrator might have defined another response to be used specifically at thissite.

3. Stop the condition response by using the following command:stopcondresp <condition name> LocalNodeAnyLoggedError <response_name>

4. Delete all the CSM-related entries from the /etc/syslog file. These entries are defined in “Setting upremote logging” on page 95. The commented entry might not exist.# all local6 notice and above priorities go to the following filelocal6.notice /var/log/csm/syslog.fabric.notices

5. Restart the syslogd command by using the /etc/init.d/syslog restart command.6. Set up the AIXSyslogSensor again by completing the following steps.

196 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 207: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

a. Copy the old sensor into a new definition file by using the lsrsrc -i -s "Name=’AIXSyslogSensor’" IBM.Sensor > /tmp/AIXSyslogSensorDef command.

b. Edit the /tmp/AIXSyslogSensorDef file.c. Change the command to "/ opt/csm/csmbin/monaixsyslog -p “local6.notice” -f

/var/log/csm/syslog.fabric.notices".d. After creating and editing the /tmp/AIXSyslogSensorDef file, remove the sensor by using the

command.rmsensor AIXSyslogSensor

Note: If the sensor did not exist, you can still continue to the next step.e. Create the new sensor and keep the management scope set to local by using the following

command.CT_MANAGEMENT_SCOPE=0 mkrsrc –f /tmp/AIXSyslogSensorDef IBM.Sensor

Note: Local management scope is required or an error indicating that the node (CSM/MS) is notin the NodeNameList file is shown.

7. Delete everything in the error monitoring directory by using the /var/opt/csm__aix_syslogcommand.

8. Restart condition response association by using the startcondresp <condition name> <responsename> command.

9. A short time later the file monaixsyslog_run-local6.notice--var-log-csm-syslog.fabric. noticesappears in /var/opt/csm_err_mon file.

10. Check the /etc/syslog.conf configuration file to ensure that the appropriate entries were added bythe monaixsyslog command. Ensure that there is only one such entry in the configuration file.local6.notice /var/log/csm/syslog.fabric.notices

Reconfiguring CSM on the Linux operating system:

To reconfigure CSM event management on the Linux operating system, complete the following steps.1. Log on to the CSM/MS.2. Run the lscondresp command to determine which condition and responses you are using. The

typical condition name is either LocalNodeAnyLoggedError for CSM/MS that is not a managed node,or AnyNodeAnyLoggedError for a CSM/MS that is a managed node. The typical response name isusually LogNodeErrorLogEntry. The condition BroadcastEventsAnyTime can also be configured.Finally, the system administrator might have defined another response to be used specifically at thissite.

3. Stop the condition response by using the following command.stopcondresp <condition name> LocalNodeAnyLoggedError <response_name>

4. Delete all the CSM-related entries from /etc/syslog.conf file or the /etc/syslog-ng/syslog-ng.conffile. These conditions are defined in “Setting up remote logging” on page 95. Typically, the entrieslook like the following example. However, the monerrorlog parameter uses a different name fromfabnotices_fifo parameter in the destination and log entries. It uses a pseudo-random name thatlooks like fifonfJGQsBw.destination fabnotices_fifo { pipe("/var/log/csm/syslog.fabric.notices" group(root)

perm(0644)); };log { source(src); filter(f_fabnotices); destination(fabnotices_fifo); };

5. Ensure that the f_fabnotices filter remains in the /etc/syslog-ng/syslog-ng.conf file by using thefollowing command.filter f_fabnotices { facility(local6) and level(notice, alert, warn, err,

crit) and not filter(f_iptables); };

6. Restart the syslogd command by using the /etc/init.d/syslog restart command.7. Set up the ErrorLogSensor again by using the following steps:

Clustering with high-performance computing by using InfiniBand hardware 197

Page 208: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

a. Copy the old sensor into a new definition file by using the lsrsrc -i -s "Name= 'ErrorLogSensor'"IBM.Sensor > /tmp/ErrorLogSensorDef command.

b. Edit the /tmp/ErrorLogSensorDef file.c. Change the command to "/opt/csm/csmbin/monerrorlog -p f_fabnotices -f /var/log/csm/

syslog.fabric.notices".d. After creating and editing the /tmp/ErrorLogSensorDef file, you can remove the sensor by using

the following command.rmsensor ErrorLogSensor

Note: If the sensor did not exist, you can still continue to the next step.e. Create the ErrorLogSensor and keep the management scope local by using the following

command.CT_MANAGEMENT_SCOPE=0 mkrsrc –f /tmp/ErrorLogSensorDef IBM.Sensor

Note: Local management scope is required or an error indicating that the node (CSM/MS) is notin the NodeNameList file is shown.

f. Run the following command./opt/csm/csmbin/monerrorlog -f "/var/log/csm/syslog.fabric.notices" -p"f_fabnotices"

Note: Notice that the –p parameter points to the f_fabnotices entry that was defined in/etc/syslog-ng/syslog-ng.conf

g. If you receive an error from the monerrorlog file indicating a problem with the syslog command,a typographical error is probably in the /etc/syslog-ng/syslog-ng.conf file. The message can belike the following example. The key is that “syslog” is in the error message screen. The * is awildcard.monerrorlog: * syslog *

1) Look for the mistake in the /etc/syslog-ng/syslog-ng.conf file by reviewing the previoussteps that you have taken to edit the syslog-ng.conf file.

2) Remove the destination and log lines from the end of syslog-ng.conf entry.3) Rerun the /opt/csm/csmbin/monerrorlog -f "/var/log/csm/syslog.fabric.notices" -p

"f_fabnotices" command.4) If you get another error, examine the file again and repeat the recovery procedures.

8. Delete everything in the error monitoring directory /var/opt/csm_err_mon.9. Edit the AppArmor setup file for the syslog-ng command by using the /etc/apparmor.d/sbin.syslog-

ng command.10. Ensure that "/var/log/csm/syslog.fabric.notices wr," is in the file before the "}". You must

remember the comma at the end of the line.11. If you changed the sbin.syslog-ng command, restart the AppArmor application by using the

/etc/init.d/boot.apparmor restart command.12. Restart the condition response association by using the startcondrespcondition name response name

command.13. In a few minutes, the following information appears in the /var/opt/csm_err_mon file.

.monerrorlog_run-f_fabnotices--var-log-csm-syslog.fabric. notices

14. Check the /etc/syslog-ng/syslog-ng.conf configuration file to ensure that the appropriate entrieswere added by the monerrorlog command. Typically, the entries look like the following example.However, the monerrorlog command uses a different name from the fabnotices_fifo command in thedestination and log entries. It uses a pseudo-random name that looks like fifonfJGQsBw.destination fabnotices_fifo { pipe("/var/log/csm/syslog.fabric.notices" group(root)

perm(0644)); };log { source(src); filter(f_fabnotices); destination(fabnotices_fifo); };

198 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 209: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Recovering from problems with clustersThere are many different aspects to recovering from a problem with clusters.

Recovering ibx interfacesThere are several levels at which you can recover ibx interfaces, which are the interfaces to the hostchannel adapter (HCA). To recover the ibx interfaces, you use the AIX operating system.

Recovering a single ibx interface in AIX:

This procedure is used to recover a single ibx interface when using the AIX operating system.

Before performing this procedure, ensure you have GPFS.

To recover a single ibx interface, run the ifconfig [ib interface] up command.

If the ifconfig [ib interface] up command does not recover the ibx interface, you might need to completelyremove and rebuild the interface by using the following command.rmdev –l [ibX]chdev –l [ibX] -a superpacket=on –a state=up -a tcp_sendspace=524288 -atcp_recvspace=524288 –a srq_size=16000mkdev –l [ibX]

Recovering all the ibx interfaces in a logical partition in the AIX operating system:

If you must recover all the ibx interfaces in a server, it is probable that you need to remove the interfacesand rebuild them.

Before using this procedure, try to recover the ibx interface by using the procedure in “Recovering asingle ibx interface in AIX.”

Before performing this procedure, ensure that you have GPFS.

The following commands can be run individually, but the following example uses loops on the commandline. The procedure must be modified based on the number of ibx interfaces in the server. The followingprocedure is an example for a server with eight ib interfaces.# get original set of ibX interfacesa=`lsdev | grep InfiniBand | awk ’{print $1}’ | egrep -v "iba|icm”`

# remove the ibX interfacesfor i in `lsdev | grep Infiniband | awk ’{print $1}’ | egrep -v "iba|icm"`do# rmdev is only used for recovery purposes, and not during installationrmdev -l $i -d

done

# remove the iba(s)for I in `lsdev | egrep “iba[0-9]” | awk ’{print $1}’`dormdev –l $i –ddone

# remove the icmrmdev –l icm -d

# map the ib interfaces to iba(s) and addressesib0=iba0; ib1=iba0ib2=iba1; ib3=iba1ib4=iba2; ib5=iba2ib6=iba3; ib7=iba4# addresses are just examples

Clustering with high-performance computing by using InfiniBand hardware 199

Page 210: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

ib0addr=192.168.1.1; ib1addr=192.168.2.1ib2addr=192.168.3.1; ib3addr=192.168.4.1ib4addr=192.168.5.1; ib5addr=192.168.6.1ib6addr=192.168.7.1; ib7addr=192.168.8.1

cfgmgr

# re-create the icmmkdev -c management -s infiniband -t icm

# re-make the iba(s) – this loop really just indicates to step through# all of the iba(s) and indicate the appropriate ibXs for each# There should be two ibX interfaces for each iba.for i in $adoeval “iba=\$${i}”eval “ib_addr=\$${i}addr”# you must provide the ibX interface number (ib0-7) and address# for each ibX interface separately.mkiba –A $iba –i $i –a $ib_addr –p 1 –P 1 –S up –m 255.255.255.0done

# Re-create the ibX interfaces correctly# This assumes that the default p_key (0xffff) is being used for# the subnetfor i in `lsdev | grep Infiniband | awk ’{print $1}’ | egrep -v "iba|icm"`dochdev -l $i -a superpacket=on –a tcp_recvspace=524288 –atcp_sendspace=524288 –a srq_size=16000 -a state=up

done

Recovering an ibx interface tcp_sendspace and tcp_recvspace:

Perform the following to recover the tcp_sendspace and tcp_recvspace attributes for an ibx interface.

Setting the ibx interface to superpacket=on accomplishes this too. Setting the interface to superpacket=ondoes not work if the interface had previously been set to superpacket=on and the tcp_sendpace ortcp_recvspace attribute values have been changed. Use the following command to set the tcp_sendpace ortcp_recvspace attribute values.# ibX = ib0, ib1, ib2, ib3, ib4, ib5, ib6 or ib7chdev –l ibX –a tcp_sendspace=524288 –a tcp_recvspace=524288

Recovering ml0 in AIX:

This procedure provides the commands required to recover mI0 when using the AIX operating system.

To recover the ml0 interface in AIX, you can remove it and rebuild it by using the following command.rmdev -l ml0 –dcfgmgr

# $ml0ip = the ip address of ml0 in this logical partitionchdev -l ml0 -a netaddr=$ml0ip -a netmask=255.255.255.0 -a state=up

Recovering InfiniBand Connection Manager (ICM) in AIX:

This information provides direction to recover ICM when using the AIX operating system.

200 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 211: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Recovering the ICM in the AIX operating system involves removing all InfiniBand interfaces and thenrebuilding them along with the ICM. This procedure is shown in “Recovering all the ibx interfaces in alogical partition in the AIX operating system” on page 199.

Recovering ehcax interfacesThere are several levels at which you can recover ehcax interfaces, which are the interfaces to the hostchannel adapter (HCA) in the Linux operating system.

Recovering a single ibx interface in Linux:

This procedure is used to recover a single ibx interface when using the Linux operating system.

To recover a single ibx interface in the Linux operating system, perform the following procedure:1. To recover a single ibx interface, first try to take down the interface and then bring it back up using

the following commands.a. ifconnfig ibX downb. ifconfig ibX up

2. If these commands do not recover the ibx interface, check for any error messages in the dmesg respattribute in the /var/log/messages file, and perform the appropriate service associated with the errormessages.

3. If the problem persists, contact your next level of support.

Recovering all of the ibx interfaces in a logical partition in the Linux operating system:

Use this procedure to recover all of the ibx interfaces in a logical partition in the Linux operating system.

To recover all of the ibx interfaces in a Linux partition, complete the following steps:1. Run the /etc/init.d/openibd restart command

Note: This command stops all devices, removes all OpenFabric Enterprise Distribution (OFED)software modules, and reloads them.

2. Verify that the interfaces are up and running by using the ifconfig | grep ib command.3. If the interfaces are not started, run the /etc/init.d/network restart command to bring up the

network of ibx interfaces.

Recovering to 4 KB maximum transfer units in the AIX operating systemUse this procedure if your cluster should be running with 4 KB maximum transfer units (MTUs), but ithas already been installed and is not currently running at 4 KB MTU. This procedure is only valid forclusters using the AIX operating system.

To complete the recovery to 4 KB MTU, the following overall tasks must be completed.1. Configuring the subnet manager to 4 KB MTU2. Setting the host channel adapter (HCAs) to 4 KB MTU3. Verifying that the subnet is set up correctly

To recover to 4 KB MTU, perform the following procedure:

Note: These instructions are written for recovering a single fabric management server subnet at a time.

Configuring the subnet manager for 4 KB MTU:

To configure the subnet manager for 4 KB MTU, perform the following steps:1. Select from the following options:

Clustering with high-performance computing by using InfiniBand hardware 201

Page 212: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

v If you are using a host-based subnet manager, continue with step 2.v If you are using an embedded subnet manager, continue with step 9.

2. If you are using a host-based subnet manager, log on to the fabric management server.3. Stop the subnet manager by using the etc/init.d/iview_fm stop command.4. Verify that the subnet manager is stopped by running the ps –ef|grep iview command.5. Edit the Fabric Manager configuration file (/etc/sysconfig/iview_fm.config) and, as needed, update

the lines defining SM_X_def_mc_mtu to 0x5, where SM_X is the subnet manager number on this fabricmanagement server. Update all subnet manager instances that are to be configured for 4 KB MTU.The following example has four subnet managers in the configuration file. Also, these lines wouldnot be contiguous in an actual configuration file.SM_0_def_mc_mtu=0x5SM_1_def_mc_mtu=0x5SM_2_def_mc_mtu=0x5SM_3_def_mc_mtu=0x5

6. Ensure that the rate matches what was planned in “Planning for maximum transfer units (MTUs)”on page 34, where 0x3 = SDR and 0x6 = DDR. The following example shows an example of thisconfiguration.SM_0_def_mc_rate=0x3 or 0x6SM_1_def_mc_rate=0x3 or 0x6SM_2_def_mc_rate =0x3 or 0x6SM_3_def_mc_rate =0x3 or 0x6

7. Start the subnet manager by using the /etc/init.d/iview_fm start command.8. Continue with “Setting the host channel adapters (HCAs) to 4 KB MTU.”9. If you are using an embedded subnet manager, log on to the switch command-line interface (CLI), or

issue these commands from the fabric management server by using cmdall, or from the ClusterSystems Management/Management Server (CSM/MS) by using the dsh command. If you use thedsh command, remember the parameter, --devicetype IBSwitch::Qlogic, as outlined in “Remotelyaccessing QLogic switches from CSM/MS” on page 144.

10. Stop the subnet manager by using the smControl stop command.11. Set up the broadcast or multicast group MTU by using the smDefBcGroup 0xffff 5 command.12. Enable the broadcast or multicast group by using the smDefBcGroup enable command.13. Start the subnet manager by using the smControl start command.14. Continue with “Setting the host channel adapters (HCAs) to 4 KB MTU.”

Setting the host channel adapters (HCAs) to 4 KB MTU:

If your server is running the AIX operating system, you must perform this procedure to correctly set upfor 4 KB MTU. To determine if you should be using 4 KB MTU, see “Planning for maximum transferunits (MTUs)” on page 34 and the “QLogic switch planning work sheets” on page 66.

To set up the 4 KB MTU, complete the following steps:1. Before you run the mkiba command, you must have correctly set up your subnet managers for 4 KB

MTU. For host-based subnet managers, see “Installing the fabric management server” on page 91. Forembedded subnet managers, see “Installing and configuring vendor InfiniBand switches” on page 116.

2. If you had previously defined the HCA devices, remove them by using the following command.for i in `lsdev | grep Infiniband | awk ’{print $1}’`dormdev -l $i -ddone

Note: This command removes all the HCA devices. To remove a specific device (such as, ib0) use thermdev –l ib0 –d command, where x = the HCA device number.

3. Run the cfgmgr command.

202 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 213: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

4. Run the mkdev command for the icm.5. Run the mkiba command for the devices.6. After the HCA device driver is installed and the mkiba command is done, run the following

commands to set the device MTU to 4 KB and turn enable super-packetsfor i in `lsdev | grep Infiniband | awk ’{print $1}’ | egrep -v "iba|icm"`dochdev -l $i -a superpacket=on –a tcp_recvspace=524288 –a tcp_sendspace=524288–a srq_size=16000 -a state=updone

Note: This command modifies all the HCA devices. To modify a specific device (such as, ib0) use acommand like the following example.chdev -l ib0 --a superpacket=on –a tcp_recvspace=524288 –atcp_sendspace=524288 –a srq_size=16000 -a state=up

7. Continue with “Verifying the 4 KB MTU configuration.”

Verifying the 4 KB MTU configuration:

Verify the configuration by performing the following steps:1. Verify that the device is set so that superpackets are on by running the following command:

for i in `lsdev | grep Infiniband | awk ’{print $1}’ | egrep -v "iba|icm"`doecho $ilsattr -El $i | egrep " super"done

Note: To verify a single device (such as, ib0) use the lsattr - El ib0 | egrep "mtu|super" command.The MTU returned should be 65532.

2. Check the interfaces for the HCA devices (ibx) and ml0 by running the following command:netstat -in | grep -v link | awk ’{print $1,$2}’

The results look like the following example, where the MTU value is in the second column.Name Mtuen2 1500ib0 65532ib1 65532ib2 65532ib3 65532ib4* 65532ib5 65532ib6 65532ib7 65532ml0 65532lo0 16896lo0 16896

3. Select from the following options:v If you are running on a host-based subnet manager, continue with step 4.v If you are running on an embedded subnet manager, continue with step 5 on page 204.

4. If you are running a host-based subnet manager, to check multicast group creation, on the fabricmanagement server run the following command./sbin/saquery –o mcmember –h [HCA] –p [HCA port]

Each interface produces an entry like the following example. Note the 4 KB MTU and 20 g rate.GID: 0xff12601bffff0000:0x0000000000000016PortGid: 0xfe80000000000000:0002550070010f00MLID: 0xc004 PKey: 0xffff Mtu: 4096 Rate: 20g PktLifeTime: 2147 msQKey: 0x00000000 SL: 0 FlowLabel: 0x00000 HopLimit: 0xff TClass: 0x00

Clustering with high-performance computing by using InfiniBand hardware 203

Page 214: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Note: You can check for misconfigured interfaces by using a command like the following example,which looks for any MTU that is not 4096 or rate is 10 g:/sbin/saquery –o mcmember –h [HCA] –p [port] | egrep –B 3 –A 1 'Mtu: [0-3]|Rate: 10g’

5. If you are running an embedded subnet manager, to check multicast group creation, run thesmShowGroups command on each switch with a master subnet manager. If you have set it up, youmight use the dsh command from the CSM/MS to the switches. For details, see “Setting up remotecommand processing” on page 103. If you use the dsh command, do not forget the parameter,--devicetype IBSwitch::Qlogic, as outlined in “Remotely accessing QLogic switches from CSM/MS”on page 144.There should be just one group with all the HCA devices on the subnet being part of the group. Notethat MTU = 5 indicates 4 KB and MTU = 4 indicates 2 KB. The following shows and example of thecommand.0xff12401bffff0000:00000000ffffffff (c000)

qKey = 0x00000000 pKey = 0xFFFF mtu = 5 rate = 3 life = 19 sl = 00x00025500101a3300 F 0x00025500101a3100 F 0x00025500101a8300 F0x00025500101a8100 F 0x00025500101a6300 F 0x00025500101a6100 F0x0002550010194000 F 0x0002550010193e00 F 0x00066a00facade01 F

Recovering to 4 KB MTUs in the Linux operating systemUse this procedure if your cluster should be running with 4 KB maximum transfer units (MTUs), but ithas already been installed and is not currently running at 4 KB MTU. This is only valid for clusters usingthe Linux operating system.

To complete the recovery to 4 KB MTU, the following overall tasks must be completed.1. Configure the subnet manager to 4 KB MTU2. Set the host channel adapter (HCAs) to 4 KB MTU3. Verify that the subnet is set up correctly

To recover to 4 KB MTU, perform the following procedure:

Note: These instructions are written for recovering a single fabric management server subnet at a time.

Configuring the subnet manager for 4 KB MTU:

To configure the subnet manager for 4 KB MTU, perform the following steps:1. Select from the following options:v If you are using a host-based subnet manager, continue with step 2.v If you are using an embedded subnet manager, continue with step 9 on page 205.

2. If you are using a host-based subnet manager, log on to the fabric management server.3. Stop the subnet manager by using the etc/init.d/iview_fm stop command.4. Verify that the subnet manager is stopped by running the ps –ef|grep iview command.5. Edit the Fabric Manager configuration file (/etc/sysconfig/iview_fm.config) and, as needed, update

the lines defining SM_X_def_mc_mtu to 0x5, where SM_X is the subnet manager number on this fabricmanagement server. Update all subnet manager instances that are to be configured for 4 KB MTU.The following example has four subnet managers in the configuration file. Also, these lines wouldnot be contiguous in an actual configuration file.SM_0_def_mc_mtu=0x5SM_1_def_mc_mtu=0x5SM_2_def_mc_mtu=0x5SM_3_def_mc_mtu=0x5

6. Ensure that the rate matches what was planned in “Planning for maximum transfer units (MTUs)”on page 34, where 0x3 = SDR and 0x6 = DDR. The following example shows an example of thisconfiguration.

204 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 215: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

SM_0_def_mc_rate=0x3 or 0x6SM_1_def_mc_rate=0x3 or 0x6SM_2_def_mc_rate =0x3 or 0x6SM_3_def_mc_rate =0x3 or 0x6

7. Start the subnet manager by using the /etc/init.d/iview_fm start command.8. Continue with “Setting up the host channel adapters (HCAs) to 4 KB MTU.”9. If you are using an embedded subnet manager, log on to the switch command-line interface (CLI), or

issue these commands from the fabric management server by using cmdall, or from the ClusterSystems Management/Management Server (CSM/MS) by using the dsh command. If you use thedsh command, remember the parameter, --devicetype IBSwitch::Qlogic, as outlined in “Remotelyaccessing QLogic switches from CSM/MS” on page 144.

10. Stop the subnet manager by using the smControl stop command.11. Set up the broadcast or multicast group MTU by using the smDefBcGroup 0xffff 5 command.12. Enable the broadcast or multicast group by using the smDefBcGroup enable command.13. Start the subnet manager by using the smControl start command.14. Continue with “Setting up the host channel adapters (HCAs) to 4 KB MTU.”

Setting up the host channel adapters (HCAs) to 4 KB MTU:

If your server is running the Linux operating system, you must perform this procedure to correctly set upfor 4 KB MTU. To determine if you should be using 4 KB MTU, see “Planning for maximum transferunits (MTUs)” on page 34 and the “QLogic switch planning work sheets” on page 66.

To set up the 4 KB MTU, complete the following steps:1. Before you run the mkiba command, you must have correctly set up your subnet managers for 4 KB

MTU. For host-based subnet managers, see “Installing the fabric management server” on page 91. Forembedded subnet managers, see “Installing and configuring vendor InfiniBand switches” on page 116.

2. Set up the /etc/sysconfig/network/ifcfg-ibX configuration files for each ib interface such that theMTU='4096'A server with two ib interfaces (ib0 and ib1) could have files like the following example.[root on c697f1sq01][/etc/sysconfig/network] => cat ifcfg-ib0BOOTPROTO=’static’BROADCAST=’10.0.1.255’IPADDR=’10.0.1.1’MTU=’4096’NETMASK=’255.255.255.0’NETWORK=’10.0.1.0’REMOTE_IPADDR=’’STARTMODE=’onboot’

[root on c697f1sq01][/etc/sysconfig/network] => cat ifcfg-ib1BOOTPROTO=’static’BROADCAST=’10.0.2.255’IPADDR=’10.0.2.1’MTU=’4096’NETMASK=’255.255.255.0’NETWORK=’10.0.2.0’REMOTE_IPADDR=’’STARTMODE=’onboot’

3. Restart the server.4. Continue with “Verifying the 4 KB MTU configuration.”

Verifying the 4 KB MTU configuration:

Verify the configuration by performing the following steps:1. Verify that the device is set so that superpackets are on by running the following command:

Clustering with high-performance computing by using InfiniBand hardware 205

Page 216: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

for i in `lsdev | grep Infiniband | awk ’{print $1}’ | egrep -v "iba|icm"`doecho $ilsattr -El $i | egrep " super"done

Note: To verify a single device (such as, ib0) use the lsattr - El ib0 | egrep "mtu|super" command.The MTU returned should be 65532.

2. Check the interfaces for the HCA devices (ibx) and ml0 by running the following command:netstat -in | grep -v link | awk ’{print $1,$2}’

The results look like the following example, where the MTU value is in the second column.Name Mtuen2 1500ib0 65532ib1 65532ib2 65532ib3 65532ib4* 65532ib5 65532ib6 65532ib7 65532ml0 65532lo0 16896lo0 16896

3. Select from the following options:v If you are running on a host-based subnet manager, continue with step4.v If you are running on an embedded subnet manager, continue with step 5.

4. If you are running a host-based subnet manager, to check multicast group creation, on the fabricmanagement server run the following command./sbin/saquery –o mcmember –h [HCA] –p [HCA port]

Each interface produces an entry like the following example. Note the 4 KB MTU and 20 g rate.GID: 0xff12601bffff0000:0x0000000000000016PortGid: 0xfe80000000000000:0002550070010f00MLID: 0xc004 PKey: 0xffff Mtu: 4096 Rate: 20g PktLifeTime: 2147 msQKey: 0x00000000 SL: 0 FlowLabel: 0x00000 HopLimit: 0xff TClass: 0x00

Note: You can check for misconfigured interfaces by using a command like the following example,which looks for any MTU that is not 4096 or rate is 10 g:/sbin/saquery –o mcmember –h [HCA] –p [port] | egrep –B 3 –A 1 'Mtu: [0-3]|Rate: 10g’

5. If you are running an embedded subnet manager, to check multicast group creation, run thesmShowGroups command on each switch with a master subnet manager. If you have set it up, youmight use the dsh command from the CSM/MS to the switches. For details, see “Setting up remotecommand processing” on page 103. If you use the dsh command, do not forget the parameter,--devicetype IBSwitch::Qlogic, as outlined in “Remotely accessing QLogic switches from CSM/MS”on page 144.There should be just one group with all the HCA devices on the subnet being part of the group. Notethat MTU = 5 indicates 4 KB and MTU = 4 indicates 2 KB. The following shows and example of thecommand.0xff12401bffff0000:00000000ffffffff (c000)

qKey = 0x00000000 pKey = 0xFFFF mtu = 5 rate = 3 life = 19 sl = 00x00025500101a3300 F 0x00025500101a3100 F 0x00025500101a8300 F0x00025500101a8100 F 0x00025500101a6300 F 0x00025500101a6100 F0x0002550010194000 F 0x0002550010193e00 F 0x00066a00facade01 F

206 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 217: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Reestablishing a health check baselineAfter changing the fabric configuration, use this procedure to reestablish a health check baseline.

The following activities are examples of ways in which the fabric configuration might be changed.v Repairing a faulty leaf card, which leads to a new serial number for that component.v Updating switch firmware or the subnet manager.v Changing time zones in a switch.v Adding or deleting a new device or link to a fabric.v A link fails and its devices are removed from the subnet manager database.

To reestablish the health check baseline, complete the following steps.1. Ensure that you have fixed all problems with the fabric, including inadvertent configuration changes

before proceeding.2. Save the original baseline. This baseline might be required for future debugging. The original baseline

is a group of files in the /var/opt/iba/analysis/baseline file.3. Run the all_analysis –b command.4. Check the new output files in the /var/opt/iba/analysis/baseline file to verify that the

configuration is as you expect it. See the Fast Fabric Toolset Users Guide for more details.

Verifying link FRU replacementsUse this procedure to verify link field replaceable unit (FRU) replacements.

Before you perform this procedure, ensure you have recorded the light emitting diode (LED) states.

Note: Proceed only if you have replaced a link FRU.1. Check the LEDs at each end of the cable.2. If the LEDs are not lit, the problem is not fixed. Return to the fault isolation procedure that sent you

here. Otherwise, proceed to the next step.3. If the LEDs are not lit before replacing the cable and they are now lit, the problem is fixed. Return to

the fault isolation procedure that sent you here. Otherwise, proceed to the next step.4. Log on to the fabric management server, or have the customer log on and perform the remaining

steps.5. Run the /sbin/iba_report –o errors –C command to check and clear the error counters. Wait several

minutes to allow new errors to accumulate.6. Run the /sbin/iba_report –o errors command again.7. If the link reports errors, the problem is not fixed. Otherwise, the problem is fixed.8. Return to the fault isolation procedure that sent you here.

Verifying repairs and configuration changesUse this procedure to verify repairs and configurations changes that have taken place with your cluster.

After a repair or configuration change has been made, it is good practice to verify that the repair hasfixed the problem and that no other problems have been introduced as a result of the repair, and that aconfiguration change has not resulted in any problems or inadvertent configuration changes.

It is important to understand that configuration changes include any change that results in differentconnectivity, fabric management code levels, part numbers, serial numbers, and any other such changes.See the Fast Fabric Toolset Users Guide for the types of configuration information that is checked duringhealth checking.

To verify repairs and configuration changes, complete the following procedure.

Clustering with high-performance computing by using InfiniBand hardware 207

Page 218: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Note: As devices come online, Notices from the subnet managers are shown. You can count the numberof devices and ensure that the count corresponds to the appropriate number of devices that you have inthe IBM systems that has been restarted. For more information, see “Counting devices” on page 210.Keep in mind that the devices might come up over several scans of the fabric by the subnet manager ormanagers, so you might have to add up the appearance counts over several log entries. However, thehealth checking procedure you perform in step 3 checks for any missing devices.1. Check light emitting diodes (LEDs) on the device ports and any port connected to the device. See the

System Users Guide and the Switch Users Guide for information about LED states. If a problem is found,see the “Symptoms of problems” on page 154.

2. Run the /sbin/iba_report –C –o none command to clear error counters on the fabric ports beforedoing a health check on the current state. Otherwise, you pick up errors caused by the restart.

3. If possible, wait approximately 10 minutes before you run a health check to look for errors andcompare against the baseline configuration. The wait period is to allow for error accumulation.Otherwise, run the health check now to check for configuration changes, which includes any nodesthat have fallen off the switch.a. Run the all_analysis command. For more information, see “Health checking” on page 135 and the

Fast Fabric Toolset Users Guide.b. Look for configuration changes and fix any that you find. For more information, see “Finding and

interpreting configuration changes” on page 147. You might see new part numbers, serialnumbers, and GUIDs for repaired devices. Fan trays do not have electronic VPD, and thus do notindicate these types of changes in configuration.

c. Look for errors and fix any that you find. For more information, see the “Symptoms of problems”on page 154.

4. If you did not wait 10 minutes before running the health check, you must rerun it after about 10minutes to check for errors.a. Run the all_analysis command, or the all_analysis -e command. For more information, see “Health

checking” on page 135 and the Fast Fabric Toolset Users Guide.b. Look for errors and fix any that you find. For more information see the “Symptoms of problems”

on page 154.c. If you did not use the –e parameter, look for configuration changes and fix any unexpected ones

that you find. For more information, see “Finding and interpreting configuration changes” on page147. Expected configuration changes are those changes that relate to repaired devices or intendedconfiguration changes.

5. If any problems are found, fix them and perform this procedure again. Continue to perform thisprocedure until the problem is resolved. This can include a successful repair or a configuration changethat has not resulted in an unexpected configuration change.

6. If there were expected configuration changes, perform the procedure in “Reestablishing a health checkbaseline” on page 207.

Restarting the clusterUse this procedure if you have performed maintenance that requires a restart of the entire cluster.

If you are performing maintenance that requires you to restart an entire cluster, the following items mustbe considered.1. Ensure that you have a baseline health check that can be used to check against when the cluster is

operational again.2. Consider disabling the subnet managers before proceeding with the restarts. This action prevents new

log entries caused by the restart process. While it also suppresses real problems, those problems areuncovered in the subsequent health check in step 7 on page 209.

3. Restart the cluster, but make sure the logical partitions stop at logical partition standby mode.4. When the IBM systems are at logical partition standby mode, restart the subnet managers.

208 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 219: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

a. As devices come back online, Notices from the subnet managers are shown. You can count thenumber of devices and make sure that the count corresponds to the appropriate number ofdevices that you have in the IBM systems that has been restarted. For more information, see“Counting devices” on page 210. Keep in mind that the devices might come up over several scansof the fabric by the subnet managers, so you might have to add up the appearance counts overseveral log entries. However, the health check that you perform in step 7 checks for any missingdevices.

5. Run the /sbin/iba_report –C –o none command to clear error counters on the fabric ports beforedoing a health check on the current state. Otherwise, you pick up errors caused by the restart.

6. Continue to restart the IBM systems through the operating system load.7. If possible, wait approximately 10 minutes before you run a health check to look for errors and

compare against the baseline configuration. The wait period allows for error accumulation. Otherwise,run the health check now to check for configuration changes, which include any nodes that havefallen off the switch.a. Run the all_analysis command. For more information, see “Health checking” on page 135 and the

Fast Fabric Toolset Users Guide.b. Look for configuration changes and fix any that you find. For more information, see “Finding and

interpreting configuration changes” on page 147.c. Look for errors and fix any that you find. For more information, see the “Symptoms of problems”

on page 154.8. If you did not wait 10 minutes before running the health check, you must rerun it after about 10

minutes to check for errors.a. Run the all_analysis command, or the all_analysis -e command. For more information, see “Health

checking” on page 135 and the Fast Fabric Toolset Users Guide.b. Look for errors and fix any that you find. For more information, see the “Symptoms of problems”

on page 154.c. If you did not use the –e parameter, look for configuration changes and fix any that you find. For

more information, see “Finding and interpreting configuration changes” on page 147.

Restarting or powering off an IBM systemIf you are restarting or powering off an IBM system for maintenance or repair, use this procedure tominimize impacts on the fabric, and to verify that the host channel adapters (HCAs) on the system haverejoined the fabric.

To restart or power off an IBM system for maintenance or repair, complete the following procedure tominimize impacts on the fabric and to verify that the HCAs on the system have rejoined the fabric.1. Restart the IBM system. Errors are logged for the HCA links going down and for the logical switches

and logical HCAs disappearing.a. You can ensure that the number of devices disappearing corresponds to the appropriate number

relative to the number of HCAs that you have in your IBM systems that have been restarted. Formore information, see “Counting devices” on page 210. In this way, you ensure that nothingdisappeared from the fabric that was not in the restarted IBM system or connected to the IBMsystem. If you do not check this at this time, the health check completed later in this procedurewill check for any missing devices, but detection of the problem will be delayed until after theIBM system has restarted.

b. If devices disappear that are not in the IBM systems or are not connected to the IBM systems, seethe “Symptoms of problems” on page 154.

2. Wait for the IBM system to restart through the operating system load.a. As devices come back online, Notices from the subnet managers are shown. You can count the

number of devices and make sure that the count corresponds to the appropriate number ofdevices that you have in the IBM systems that have been restarted. For more information, see“Counting devices” on page 210. Keep in mind that the devices might come up over several scans

Clustering with high-performance computing by using InfiniBand hardware 209

Page 220: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

of the fabric by the subnet managers, so you might have to add up the appearance counts overseveral log entries. However, the health check that you perform in step 4 checks for any missingdevices.

3. Run the /sbin/iba_report –C –o none command to clear error counters on the fabric ports beforedoing a health check on the current state. Otherwise, you pick up errors caused by the reboot.

4. If possible, wait about 10 minutes before you run a health check to look for errors and compareagainst the baseline configuration. The wait period is to allow for error accumulation. Otherwise, runthe health check now to check for configuration changes, which include any nodes that have fallen offthe switch.a. Run the all_analysis command. For more information, see “Health checking” on page 135 and the

Fast Fabric Toolset Users Guide.b. Look for configuration changes and fix any that you find. For more information, see “Finding and

interpreting configuration changes” on page 147.c. Look for errors and fix any that you find, For more information, see the “Symptoms of problems”

on page 154.5. If you did not wait 10 minutes before running the health check, you can rerun it after approximately

10 minutes to check for errors.a. Run the all_analysis command, or the all_analysis -e command. For more information, see “Health

checking” on page 135 and the Fast Fabric Toolset Users Guide.b. Look for errors and fix any that you find. For more information, see the “Symptoms of problems”

on page 154.c. If you did not use the –e parameter, look for configuration changes and fix any that you find. For

more information, see “Finding and interpreting configuration changes” on page 147.6. If you repaired an HCA, the latest health check identifies that you have a new GUID in the fabric.

You need to perform the procedure in “Reestablishing a health check baseline” on page 207, however,only do that after you have run a health check against the old baseline to ensure that the repair actionresulted in no inadvertent configuration changes, such as a swapping of cables.

Counting devicesWhen faults or user actions cause devices to appear and disappear from the fabric, you can use thisinformation to count the devices that you expect to be part of your fabric.

Subnet managers in the industry tend to report resources at a low level.

The virtualization capabilities of the IBM GX host channel adapters (HCAs) complicate the counting ofdevices because of how logical devices are interpreted by the subnet manager.

The following resources are reported by the subnet manager when they appear or disappear. Even if theexact resource is not always given, there is a count given.v Switchesv HCAs or host channel adaptersv End portsv Portsv Subnet managers

Note: The count of the number of resources is given by an individual subnet manager. If there aremultiple subnets, you must add up the results from each master subnet manager on each subnet.

Counting switches:

Use this procedure to count the number of switches on your fabric.

210 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 221: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Physical switches come in two varieties.v 24-port base switchv Director level switches with spines and leaf cards. These switches are composed of 48 ports or more.

With the IBM GX host channel adapter (HCA), you can get a logical switch per each physical port. Thisswitch is what connects the logical HCAs to the physical ports, which yields the capability to virtualizethe HCA.

A physical switch is constructed using one or more switch chips. A switch chip has 24 ports used toconstruct the fabric. A base 24-port switch, needs only one switch chip to yield 24 ports.

The director level switches (such as the 9120) use cascading switch chips that are interconnected to yielda larger number of ports supported by a given chassis. This topic introduces the concept of leaf cards thathave the cables and spines that interconnect the various leaf cards, thus, allowing where the data canflow in any cable port of the switch and out to any other cable port. The key is to remember that thereare 24 ports on a switch.

Each spine has two switch chips. To maintain cross-sectional bandwidth performance, you want a spineport for each cable port. So, a single spine can support up to 48 ports. The standard sizes are 48, 96, 144,and 288 port switches. You will note that these require 1, 2, 3 and 6 spines, respectively.

A leaf card has a single switch chip. A standard spine has (12) 4X cable connectors. The number ofrequired leaf cards is calculated by dividing (the number of cables) by 12. After using 12 switch chipports for cable connections, there are 12 left over for connecting to spine chips.

With one spine, there are two switch chips, yielding 48 ports on the spines. With 12 ports per leaf card,that means a spine can support four leaf cards. You can see that this requires half a spine switch chip perleaf card.

Table 83. Counting switch chips in a fabric

Number of ports Number of leaf cards Number of spines Switch chips

48 4 1 4*1 + 2*1 = 6

96 8 2 8*1 + 2*2 = 10

144 12 3 12*1 + 2*3 = 18

288 24 6 24*1 + 2*6 = 36

Counting logical switches:

Use this information to count the number of logical switches on your fabric.

The number of logical switches is equal to the number of IBM GX or GX+ host channel adapter (HCA)ports. The logical switch is the virtualization device on the GX or GX+ HCA. For more information, see“IBM GX or GX+ host channel adapters” on page 9.

Counting host channel adapters:

Use this information to count the number of host channel adapters (HCAs) on the fabric. The number ofHCAs depends on the type of HCAs used.

There is one HCA per physical PCI HCA card. Do not forget the HCAs used in the InfiniBandManagement nodes.

Clustering with high-performance computing by using InfiniBand hardware 211

Page 222: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

The number of HCAs per IBM GX or GX+ HCA depends on the number of logical partitions defined.There is a logical HCA per each logical partition defined to use the HCA. For more information, see “IBMGX or GX+ host channel adapters” on page 9.

Counting end ports:

Use this information to count the number of end ports on the fabric. The end ports depend on the type ofhost channel adapters (HCAs) used and the number of cables that are connected.

The number of end ports for PCI HCAs is equal to the number of connected cable connectors on the PCIHCAs.

The IBM GX or GX+ HCA has two ports connected to logical HCAs.

Counting ports:

Use this information to count the number of ports on the fabric.

The total number of ports is composed of all the ports from all the devices in the fabric. In addition, thereis a port used for management of the device. This is not to be confused with a management port on aswitch that connects to a cluster virtual local area network (VLAN). Instead, each switch chip and HCAdevice has a management port associated with it, too.

Table 84. Counting Fabric Ports

Device Number of ports

Spine switch chip 25 = 24 for fabric + 1 for management

Leaf card switch chip 13 + (number of connected cables) = 12 connected tospines + 1 for management + (number of connectedcables)

24-port switch chip 1 + (number of connected cables) = 1 for management +(number of connected cables)

PCI HCAs Number of connected cables

Logical switch 1 + 1 + (number of logical partitions = 1 physical port +1 for management + 1 for each logical partition that usesthis HCA

Counting subnet managers:

Use this information to count the number of subnet managers on the fabric.

The number of subnet managers is equal to one master plus the number of standbys on the subnet.

Example: Counting devices:

This example shows how the number of devices on a fabric is calculated.

For this example, the configuration for the subnet is shown in the following table.

Table 85. Example configuration

QuantityDevices Connectivity

1 9024 switch 5 HCA connections + 4 connections to the 9120

1 9120 switch 5 HCA connections + 4 connections to the 9024

3 9125-F2A (1) IBM GX HCAs per node

212 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 223: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Table 85. Example configuration (continued)

QuantityDevices Connectivity

3 IBM GX host channel adapters (HCAs) 1 connection to 9024; 1 connection to 9120

2 InfiniBand Management Hosts (1) 2-port PCI HCA per host

2 PCI HCAs 1 connection to 9024; 1 connection to 9120

The resulting report from the master subnet manager is shown in the following table.DETAIL:25 SWs, 5 HCAs, 10 end ports, 353 total ports, 4 SM(s)

Table 86. Report from the master subnet manager

Resource Count Calculation

Switches 25 (1) per 9024# 12 leaf chips per 9120 +# 2 chips * 3 spines per 9120 +# 2 logical switch per HCAs * 3GX HCAs= 1 + 12 + 6 + 6 = 25

HCAs 5 (2) PCI HCAs + (3) IBM GX HCAs

End ports 10 5 HCAs * 2

Ports 353 See the example ports calculation thatfollow

Subnet managers 4 (1) Master + (3) Standbys

The following table illustrates how the number of ports were calculated.

Table 87. Number of ports calculation

Device Ports Calculation

9024 10 (3) connections to GX HCAs +(2) connections to PCI HCAs +(4) switch to switch connections +(1) management port

9120 spines 150 25 ports * 3 spines * 2 switch chipsper spine

9120 leaf chips 165 (13 ports * 12 leaf chips) +(3) connections to GX HCAs +(2) connections to PCI HCAs +(4) switch to switch connections

Logical switches 18 3 ports * 6 logical switches

Logical HCAs 6 2 ports * 3 logical HCAs

PCI HCAs 4 2 ports * 2 HCAs

Total Port Count = 10 + 150+165+18+6+4 = 353

Handling emergency power off situationsThis information provides guidelines for setting up a procedure to handle power off situations.

Emergency Power-off (EPO) situations are typically rare events. However, some sites do experience morepower issues than others for various reasons, including power grid considerations. It is good practice foreach site to develop an EPO procedure.

Clustering with high-performance computing by using InfiniBand hardware 213

Page 224: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

This procedure is a sample procedure that can be used with QLogic switches assuming that you have anissue with the external 480 V ac power to the servers. Details on how to complete each step have beenomitted. To finalize the procedure, you need the vendor switch User Manual, the Fabric Management UsersGuide, and the server service information. Example commands are shown, but they should be verifiedwith the latest User Manuals and service information.

If there is a compelling reason for completing a certain step, or for doing a step at a certain time in theprocedure, a note follows with the reason why.1. To reduce the number of events in the logs for the resulting link downs, shut down the subnet

managers.

Note: Excessive log entries can mask real problems later and also cause problems with extensivedebugging by upper levels of support.

2. Perform an EPO of the IBM systems running on external 480 V ac power. Depending on the nature ofthe EPO, you can leave the switches up (if adequate cooling and power can be supplied to them).a. If you cannot leave the switches running, and you have stopped the embedded subnet managers,

you can shut down the switches at any time. You can either power off at a circuit-breaker orremove all the switch power cables, because they have no physical or virtual power switches.

b. If you have to power off the fabric management servers and you can do it before the IBM systemsand the vendor switches, that would eliminate the need to shut down subnet managers.

Note: Consider the implications of excessive logging if you leave subnet managers running whileshutting down devices on the fabric.

3. Press the 480 V ac external power wall EPO switch.4. Once the situation is resolved, restore wall power and repower the servers.5. After the servers are operational, check the LEDs for indications of problems on the servers and

switches and switch ports.6. Start the subnet managers. If you had powered off the fabric management server running subnet

managers, and the subnet managers were configured to auto-start, all you need to do is start thefabric management server after you start the other servers. If the switches have embedded subnetmanagers configured for auto-start, then the subnet managers restart when the switches come backonline.

7. Run health check against the baseline to see if anything is missing (or otherwise changed).8. Reset link error counters. All this EPO activity could cause link error counters to advance because the

EPO is occurring at any time, even during applications passing data on the fabric.a. On the fabric management server running the Fast Fabric Toolset, run the iba_report –C –o none

command. If you have more subnets than can be managed from one fabric management server,you need to process this command from all the master fabric management servers.

Monitoring and checking for fabric problemsFabric problems can surface in several different ways. While the subnet manager and switch logging arethe main reporting mechanisms, there are other methods for checking for problems.

To monitor and check for fabric problems, complete the following steps.1. Inspect the Cluster Systems Management/Management Server (CSM/MS) /var/log/csm/errors/[CSM/

MS hostname] log for subnet manager and switch log entries.2. Run the Fast Fabric Health Check tool.

214 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 225: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Appendix. Notices

This information was developed for products and services offered in the U.S.A.

The manufacturer may not offer the products, services, or features discussed in this document in othercountries. Consult the manufacturer's representative for information on the products and servicescurrently available in your area. Any reference to the manufacturer's product, program, or service is notintended to state or imply that only that product, program, or service may be used. Any functionallyequivalent product, program, or service that does not infringe any intellectual property right of themanufacturer may be used instead. However, it is the user's responsibility to evaluate and verify theoperation of any product, program, or service.

The manufacturer may have patents or pending patent applications covering subject matter described inthis document. The furnishing of this document does not grant you any license to these patents. You cansend license inquiries, in writing, to the manufacturer.

The following paragraph does not apply to the United Kingdom or any other country where suchprovisions are inconsistent with local law: THIS INFORMATION IS PROVIDED “AS IS” WITHOUTWARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR APARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certaintransactions, therefore, this statement may not apply to you.

This information could include technical inaccuracies or typographical errors. Changes are periodicallymade to the information herein; these changes will be incorporated in new editions of the publication.The manufacturer may make improvements and/or changes in the product(s) and/or the program(s)described in this publication at any time without notice.

Any references in this information to Web sites not owned by the manufacturer are provided forconvenience only and do not in any manner serve as an endorsement of those Web sites. The materials atthose Web sites are not part of the materials for this product and use of those Web sites is at your ownrisk.

The manufacturer may use or distribute any of the information you supply in any way it believesappropriate without incurring any obligation to you.

Any performance data contained herein was determined in a controlled environment. Therefore, theresults obtained in other operating environments may vary significantly. Some measurements may havebeen made on development-level systems and there is no guarantee that these measurements will be thesame on generally available systems. Furthermore, some measurements may have been estimated throughextrapolation. Actual results may vary. Users of this document should verify the applicable data for theirspecific environment.

Information concerning products not produced by this manufacturer was obtained from the suppliers ofthose products, their published announcements or other publicly available sources. This manufacturer hasnot tested those products and cannot confirm the accuracy of performance, compatibility or any otherclaims related to products not produced by this manufacturer. Questions on the capabilities of productsnot produced by this manufacturer should be addressed to the suppliers of those products.

All statements regarding the manufacturer's future direction or intent are subject to change or withdrawalwithout notice, and represent goals and objectives only.

© Copyright IBM Corp. 2009 215

Page 226: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

The manufacturer's prices shown are the manufacturer's suggested retail prices, are current and aresubject to change without notice. Dealer prices may vary.

This information is for planning purposes only. The information herein is subject to change before theproducts described become available.

This information contains examples of data and reports used in daily business operations. To illustratethem as completely as possible, the examples include the names of individuals, companies, brands, andproducts. All of these names are fictitious and any similarity to the names and addresses used by anactual business enterprise is entirely coincidental.

If you are viewing this information in softcopy, the photographs and color illustrations may not appear.

The drawings and specifications contained herein shall not be reproduced in whole or in part without thewritten permission of the manufacturer.

The manufacturer has prepared this information for use with the specific machines indicated. Themanufacturer makes no representations that it is suitable for any other purpose.

The manufacturer's computer systems contain mechanisms designed to reduce the possibility ofundetected data corruption or loss. This risk, however, cannot be eliminated. Users who experienceunplanned outages, system failures, power fluctuations or outages, or component failures must verify theaccuracy of operations performed and data saved or transmitted by the system at or near the time of theoutage or failure. In addition, users must establish procedures to ensure that there is independent dataverification before relying on such data in sensitive or critical operations. Users should periodically checkthe manufacturer's support websites for updated information and fixes applicable to the system andrelated software.

TrademarksIBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International BusinessMachines Corp., registered in many jurisdictions worldwide. Other product and service names might betrademarks of IBM or other companies. A current list of IBM trademarks is available on the Web atCopyright and trademark information at www.ibm.com/legal/copytrade.shtml.

INFINIBAND, InfiniBand Trade Association, and the INFINIBAND design marks are trademarks and/orservice marks of the INFINIBAND Trade Association.

Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.

Microsoft and Windows are trademarks of Microsoft Corporation in the United Sates, other countries, orboth.

Other company, product, or service names may be trademarks or service marks of others.

Electronic emission notices

Class A NoticesThe following Class A statements apply to the IBM servers that contain the POWER6 processor.

Federal Communications Commission (FCC) statement

Note: This equipment has been tested and found to comply with the limits for a Class A digital device,pursuant to Part 15 of the FCC Rules. These limits are designed to provide reasonable protection againstharmful interference when the equipment is operated in a commercial environment. This equipment

216 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 227: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

generates, uses, and can radiate radio frequency energy and, if not installed and used in accordance withthe instruction manual, may cause harmful interference to radio communications. Operation of thisequipment in a residential area is likely to cause harmful interference, in which case the user will berequired to correct the interference at his own expense.

Properly shielded and grounded cables and connectors must be used in order to meet FCC emissionlimits. IBM is not responsible for any radio or television interference caused by using other thanrecommended cables and connectors or by unauthorized changes or modifications to this equipment.Unauthorized changes or modifications could void the user's authority to operate the equipment.

This device complies with Part 15 of the FCC rules. Operation is subject to the following two conditions:(1) this device may not cause harmful interference, and (2) this device must accept any interferencereceived, including interference that may cause undesired operation.

Industry Canada Compliance Statement

This Class A digital apparatus complies with Canadian ICES-003.

Avis de conformité à la réglementation d'Industrie Canada

Cet appareil numérique de la classe A respecte est conforme à la norme NMB-003 du Canada.

European Community Compliance Statement

This product is in conformity with the protection requirements of EU Council Directive 2004/108/EC onthe approximation of the laws of the Member States relating to electromagnetic compatibility. IBM cannotaccept responsibility for any failure to satisfy the protection requirements resulting from anon-recommended modification of the product, including the fitting of non-IBM option cards.

This product has been tested and found to comply with the limits for Class A Information TechnologyEquipment according to European Standard EN 55022. The limits for Class A equipment were derived forcommercial and industrial environments to provide reasonable protection against interference withlicensed communication equipment.

European Community contact:IBM Technical RegulationsPascalstr. 100, Stuttgart, Germany 70569Tele: 0049 (0)711 785 1176Fax: 0049 (0)711 785 1283E-mail: [email protected]

Warning: This is a Class A product. In a domestic environment, this product may cause radiointerference, in which case the user may be required to take adequate measures.

VCCI Statement - Japan

The following is a summary of the VCCI Japanese statement in the box above:

Appendix. Notices 217

Page 228: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

This is a Class A product based on the standard of the VCCI Council. If this equipment is used in adomestic environment, radio interference may occur, in which case, the user may be required to takecorrective actions.

Japanese Electronics and Information Technology Industries Association (JEITA)Confirmed Harmonics Guideline (products less than or equal to 20 A per phase)

Japanese Electronics and Information Technology Industries Association (JEITA)Confirmed Harmonics Guideline with Modifications (products greater than 20 A perphase)

Electromagnetic Interference (EMI) Statement - People's Republic of China

Declaration: This is a Class A product. In a domestic environment this product may cause radiointerference in which case the user may need to perform practical action.

Electromagnetic Interference (EMI) Statement - Taiwan

The following is a summary of the EMI Taiwan statement above.

Warning: This is a Class A product. In a domestic environment this product may cause radio interferencein which case the user will be required to take adequate measures.

IBM Taiwan Contact Information:

218 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 229: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Electromagnetic Interference (EMI) Statement - Korea

Please note that this equipment has obtained EMC registration for commercial use. In the event that ithas been mistakenly sold or purchased, please exchange it for equipment certified for home use.

Germany Compliance Statement

Deutschsprachiger EU Hinweis: Hinweis für Geräte der Klasse A EU-Richtlinie zurElektromagnetischen Verträglichkeit

Dieses Produkt entspricht den Schutzanforderungen der EU-Richtlinie 2004/108/EG zur Angleichung derRechtsvorschriften über die elektromagnetische Verträglichkeit in den EU-Mitgliedsstaaten und hält dieGrenzwerte der EN 55022 Klasse A ein.

Um dieses sicherzustellen, sind die Geräte wie in den Handbüchern beschrieben zu installieren und zubetreiben. Des Weiteren dürfen auch nur von der IBM empfohlene Kabel angeschlossen werden. IBMübernimmt keine Verantwortung für die Einhaltung der Schutzanforderungen, wenn das Produkt ohneZustimmung der IBM verändert bzw. wenn Erweiterungskomponenten von Fremdherstellern ohneEmpfehlung der IBM gesteckt/eingebaut werden.

EN 55022 Klasse A Geräte müssen mit folgendem Warnhinweis versehen werden:"Warnung: Dieses ist eine Einrichtung der Klasse A. Diese Einrichtung kann im WohnbereichFunk-Störungen verursachen; in diesem Fall kann vom Betreiber verlangt werden, angemesseneMaßnahmen zu ergreifen und dafür aufzukommen."

Deutschland: Einhaltung des Gesetzes über die elektromagnetische Verträglichkeit von Geräten

Dieses Produkt entspricht dem “Gesetz über die elektromagnetische Verträglichkeit von Geräten(EMVG)“. Dies ist die Umsetzung der EU-Richtlinie 2004/108/EG in der Bundesrepublik Deutschland.

Zulassungsbescheinigung laut dem Deutschen Gesetz über die elektromagnetische Verträglichkeit vonGeräten (EMVG) (bzw. der EMC EG Richtlinie 2004/108/EG) für Geräte der Klasse A.

Dieses Gerät ist berechtigt, in Übereinstimmung mit dem Deutschen EMVG das EG-Konformitätszeichen- CE - zu führen.

Verantwortlich für die Konformitätserklärung nach des EMVG ist die IBM Deutschland GmbH, 70548Stuttgart.

Appendix. Notices 219

Page 230: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Generelle Informationen:

Das Gerät erfüllt die Schutzanforderungen nach EN 55024 und EN 55022 Klasse A.

Electromagnetic Interference (EMI) Statement - Russia

Terms and conditionsPermissions for the use of these publications is granted subject to the following terms and conditions.

Personal Use: You may reproduce these publications for your personal, noncommercial use provided thatall proprietary notices are preserved. You may not distribute, display or make derivative works of thesepublications, or any portion thereof, without the express consent of the manufacturer.

Commercial Use: You may reproduce, distribute and display these publications solely within yourenterprise provided that all proprietary notices are preserved. You may not make derivative works ofthese publications, or reproduce, distribute or display these publications or any portion thereof outsideyour enterprise, without the express consent of the manufacturer.

Except as expressly granted in this permission, no other permissions, licenses or rights are granted, eitherexpress or implied, to the publications or any data, software or other intellectual property containedtherein.

The manufacturer reserves the right to withdraw the permissions granted herein whenever, in itsdiscretion, the use of the publications is detrimental to its interest or, as determined by the manufacturer,the above instructions are not being properly followed.

You may not download, export or re-export this information except in full compliance with all applicablelaws and regulations, including all United States export laws and regulations.

THE MANUFACTURER MAKES NO GUARANTEE ABOUT THE CONTENT OF THESEPUBLICATIONS. THESE PUBLICATIONS ARE PROVIDED "AS-IS" AND WITHOUT WARRANTY OFANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO IMPLIEDWARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT, AND FITNESS FOR A PARTICULARPURPOSE.

220 Power Systems: Clustering with high-performance computing by using InfiniBand hardware

Page 231: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware
Page 232: Power Systems: Clustering with high-performance computing by … · Power Systems: Clustering with high-performance computing by using InfiniBand hardware

����

Printed in USA