94
Dolphin Express for MySQL Installation and Reference Guide Dolphin Interconnect Solutions ASA

DIS Install Guide Book

Embed Size (px)

Citation preview

Page 1: DIS Install Guide Book

Dolphin Express for MySQL

Installation and Reference Guide

Dolphin Interconnect Solutions ASA

Page 2: DIS Install Guide Book

Dolphin Express for MySQL: Installation and Reference GuideDolphin Interconnect Solutions ASA

by

This document describes the Dolphin Express software stack version 3.3.0.

Published November 13th, 2007Copyright © 2007 Dolphin Interconnect Solutions ASA

Published under Gnu Public License v2

Page 3: DIS Install Guide Book
Page 4: DIS Install Guide Book

iv

Abstract ..................................................................................................................................... vii1. Introduction & Overview ............................................................................................................ 1

1. Who needs Dolphin Express and SuperSockets? ..................................................................... 12. How do Dolphin Express and SuperSockets work? .................................................................. 13. Terminology ..................................................................................................................... 14. Contact & Feedback: Dolphin Support .................................................................................. 2

2. Requirements & Planning ........................................................................................................... 31. Supported Platforms .......................................................................................................... 3

1.1. Hardware ............................................................................................................... 31.1.1. Supported Platforms ...................................................................................... 31.1.2. Recommended Node Hardware ....................................................................... 31.1.3. Recommended Frontend Hardware .................................................................. 3

1.2. Software ................................................................................................................ 31.2.1. Linux ......................................................................................................... 41.2.2. Windows .................................................................................................... 41.2.3. Solaris ........................................................................................................ 41.2.4. Others ........................................................................................................ 4

2. Interconnect Planning ......................................................................................................... 42.1. Nodes to Equip with Dolphin Express Interconnect ....................................................... 4

2.1.1. MySQL Cluster ............................................................................................ 42.2. Interconnect Topology ............................................................................................. 52.3. Physical Node Placement ......................................................................................... 5

3. Initial Installation ...................................................................................................................... 61. Overview ......................................................................................................................... 62. Installation Requirements .................................................................................................... 6

2.1. Live Installation ..................................................................................................... 72.2. Non-GUI Installation ............................................................................................... 7

2.2.1. No X / GUI on Frontend ............................................................................... 72.2.2. No X / GUI Anywhere .................................................................................. 7

3. Adapter Card Installation .................................................................................................... 84. Software and Cable Installation ........................................................................................... 8

4.1. Overview ............................................................................................................... 84.2. Starting the Software Installation ............................................................................... 94.3. Working with the dishostseditor ............................................................................... 13

4.3.1. Cluster Edit ............................................................................................... 134.3.2. Node Arrangement ...................................................................................... 144.3.3. Cabling Instructions .................................................................................... 16

4.4. Cluster Cabling ..................................................................................................... 174.4.1. Connecting the cables .................................................................................. 174.4.2. Verifying the Cabling .................................................................................. 18

4.5. Finalising the Software Installation ........................................................................... 194.5.1. Static SCI Connectivity Test ......................................................................... 194.5.2. SuperSockets Configuration Test ................................................................... 204.5.3. SuperSockets Performance Test ..................................................................... 20

4.6. Handling Installation Problems ................................................................................ 204.7. Interconnect Validation with sciadmin ....................................................................... 20

4.7.1. Installing sciadmin ...................................................................................... 214.7.2. Starting sciadmin ........................................................................................ 214.7.3. Cluster Overview ........................................................................................ 214.7.4. Cabling Correctness Test .............................................................................. 224.7.5. Fabric Quality Test ..................................................................................... 22

4.8. Making Cluster Application use Dolphin Express ........................................................ 234.8.1. Generic Socket Applications ......................................................................... 234.8.2. Native SCI Applications .............................................................................. 254.8.3. Kernel Socket Services ................................................................................ 26

4. Update Installation ................................................................................................................... 271. Complete Update ............................................................................................................. 272. Rolling Update ................................................................................................................ 27

Page 5: DIS Install Guide Book

Dolphin Express for MySQL

v

5. Manual Installation .................................................................................................................. 301. Installation under Load ..................................................................................................... 302. Installation of a Heterogeneous Cluster ................................................................................ 323. Manual RPM Installation .................................................................................................. 33

3.1. RPM Package Structure .......................................................................................... 333.2. RPM Build and Installation ..................................................................................... 34

4. Unpackaged Installation .................................................................................................... 346. Interconnect and Software Maintenance ....................................................................................... 37

1. Verifying Functionality and Performance ............................................................................. 371.1. Low-level Functionality and Performance .................................................................. 37

1.1.1. Availability of Drivers and Services ............................................................... 371.1.2. Cable Connection Test ................................................................................. 371.1.3. Static Interconnect Test ................................................................................ 371.1.4. Interconnect Load Test ................................................................................ 391.1.5. Interconnect Performance Test ...................................................................... 41

1.2. SuperSockets Functionality and Performance .............................................................. 421.2.1. SuperSockets Status .................................................................................... 421.2.2. SuperSockets Functionality ........................................................................... 42

1.3. SuperSockets Utilization ......................................................................................... 432. Replacing SCI Cables ....................................................................................................... 443. Replacing a PCI-SCI Adapter ............................................................................................ 454. Physically Moving Nodes ................................................................................................. 455. Replacing a Node ............................................................................................................ 466. Adding Nodes ................................................................................................................. 467. Removing Nodes ............................................................................................................. 47

7. MySQL Operation ................................................................................................................... 491. MySQL Cluster ............................................................................................................... 49

1.1. SuperSockets Poll Optimization ............................................................................... 491.2. NDBD Deadlock Timeout ...................................................................................... 491.3. SCI Transporter .................................................................................................... 49

2. MySQL Replication ......................................................................................................... 508. Advanced Topics ..................................................................................................................... 51

1. Notification on Interconnect Status Changes ......................................................................... 511.1. Interconnect Status ................................................................................................ 511.2. Notification Interface ............................................................................................. 511.3. Setting Up and Controlling Notification .................................................................... 52

1.3.1. Configure Notification via the dishostseditor .................................................... 521.3.2. Configure Notification Manually ................................................................... 521.3.3. Verifying Notification .................................................................................. 531.3.4. Disabling and Enabling Notification Temporarily .............................................. 53

2. Managing IRM Resources ................................................................................................. 532.1. Updates with Modified IRM Configuration ................................................................ 53

9. FAQ ...................................................................................................................................... 551. Hardware ....................................................................................................................... 552. Software ........................................................................................................................ 56

A. Self-Installing Archive (SIA) Reference ...................................................................................... 591. SIA Operating Modes ....................................................................................................... 59

1.1. Full Cluster Installation .......................................................................................... 591.2. Node Installation ................................................................................................... 591.3. Frontend Installation .............................................................................................. 591.4. Installation of Configuration File Editor .................................................................... 591.5. Building RPM Packages Only ................................................................................. 591.6. Extraction of Source Archive .................................................................................. 59

2. SIA Options ................................................................................................................... 602.1. Node Specification ................................................................................................ 602.2. Installation Path Specification .................................................................................. 602.3. Installing from Binary RPMs .................................................................................. 602.4. Preallocation of SCI Memory .................................................................................. 60

Page 6: DIS Install Guide Book

Dolphin Express for MySQL

vi

2.5. Enforce Installation ............................................................................................... 612.6. Configuration File Specification ............................................................................... 612.7. Batch Mode ......................................................................................................... 612.8. Non-GUI Build Mode ............................................................................................ 612.9. Software Removal ................................................................................................. 62

B. sciadmin Reference ................................................................................................................. 631. Startup ........................................................................................................................... 632. Interconnect Status View .................................................................................................. 63

2.1. Icons ................................................................................................................... 632.2. Operation ............................................................................................................. 64

2.2.1. Cluster Status ............................................................................................. 642.2.2. Node Status ............................................................................................... 68

3. Node and Interconnect Control ........................................................................................... 683.1. Admin Menu ........................................................................................................ 683.2. Cluster Menu ....................................................................................................... 693.3. Node Menu .......................................................................................................... 703.4. Cluster Settings ..................................................................................................... 703.5. Adapter Settings ................................................................................................... 72

4. Interconnect Testing & Diagnosis ....................................................................................... 734.1. Cable Test ........................................................................................................... 734.2. Traffic Test .......................................................................................................... 75

C. Configuration Files .................................................................................................................. 781. Cluster Configuration ....................................................................................................... 78

1.1. dishosts.conf ......................................................................................................... 781.1.1. Basic settings ............................................................................................. 781.1.2. SuperSockets settings .................................................................................. 791.1.3. Miscellaneous Notes .................................................................................... 80

1.2. networkmanager.conf ............................................................................................. 801.3. cluster.conf .......................................................................................................... 80

2. SuperSockets Configuration ............................................................................................... 802.1. supersockets_profiles.conf ....................................................................................... 802.2. supersockets_ports.conf .......................................................................................... 81

3. Driver Configuration ........................................................................................................ 813.1. dis_irm.conf ......................................................................................................... 82

3.1.1. Resource Limitations ................................................................................... 823.1.2. Memory Preallocation .................................................................................. 823.1.3. Logging and Messages ................................................................................ 84

3.2. dis_ssocks.conf ..................................................................................................... 84D. Platform Issues and Software Limitations .................................................................................... 86

1. Platforms with Known Problems ........................................................................................ 862. IRM .............................................................................................................................. 863. SuperSockets .................................................................................................................. 86

Page 7: DIS Install Guide Book

vii

AbstractThis document describes the installation of the Dolphin Interconnect Solutions (DIS) Dolphin Express interconnecthardware and the DIS software stack, including SuperSockets, on single machines or on a cluster of machines. Thissoftware stack is needed to use Dolphin's Dolphin Express high-performance interconnect products and consists ofdrivers (kernel modules), user space libraries and applications, an SDK, documentation and more. SuperSocketsdrastically accelerate generic socket communication as used by clustered applications.

Page 8: DIS Install Guide Book

1

Chapter 1. Introduction & Overview1. Who needs Dolphin Express and SuperSockets?Clustered applications running on multiple machines communicating via an Ethernet-based network often sufferfrom the delays that occur when data needs to be exchanged between processes running on different machines.These delays caused by the communication time make processes wait for data when they otherwise could performuseful work. Dolphin Express is a combination of a high-performance interconnect hardware to replace the Eth-ernet network together with a highly optimized software stack.

One part of this software stack is SuperSockets which implement a bypass of the TCP/UDP/IP protocol stack forstandard socket-based inter-process communication. This bypass moves data directly via the high-performanceinterconnect and thereby reduces the minimal latency typically by a factor of 10 and more with 100% binaryapplication compatibility. Using this combined software/hardware approach with MySQL Cluster, throughputimprovements for the TPC-C like DBT2 benchmark of 300% and more have been measured already on smallclusters. For larger clusters, this advantage continues to increase as the communication fraction of the processingtime will increase.

2. How do Dolphin Express and SuperSockets work?The Dolphin Express hardware provides means for a process on one machine to write data directly into the addressspace of a process running on a remote machine. This can be done using either direct store operations of theCPU (for lowest latency), or using the DMA engine of the Dolphin Express interconnect adapter (for lowest CPUutilization).

SuperSockets consists of both, kernel modules and a user-space library. The implementation on kernel-level makessure that the SuperSockets socket-implementation is fully compatible with the TCP/UDP/IP-based sockets provid-ed by the operating system. By being explicitly preloaded, the user-space library operates between the unmodifiedbinary of the applications and the operating system and intercepts all socket-related function calls. Based on thesystem configuration and a potential user-provided configuration, the library makes a first decision if this functioncall will be processed by SuperSockets or the standard socket implementation and redirects it accordingly. TheSuperSockets kernel module then performs the operation on the Dolphin Express interconnect. If necessary, it canfall back and forward to Ethernet transparently even when the socket is under load.

3. TerminologyWe define some terms that will be used throughout this document.

adapter A PCI-to-SCI (D33x series), PCI-Express-to-SCI (D35x series) or PCI-Ex-press fabric (DXH series) adapter. This is the Dolphin Express hardware in-stalled in the cluster nodes.

node A computer which is part of the Dolphin Express interconnect, which meansit has an adapter installed. All nodes together constitute the cluster.

CPU architecture The CPU architecture relevant in this guide is characterized by the addressingwidth of the CPU (32 or 64 bit) and the instruction set (x86, Sparc, etc.). Ifthese two characteristics are identical, the CPU architecture is identical forthe scope of this guide.

link A directed point-to-point connection in the SCI interconnect. Physically, alink is the cable leading from the output of one adapter to the input of anotheradapter.

ringlet For an SCI interconnect configured in torus topology, the links are connectedas closed multiple closed rings. For a two-dimensional torus topology like

Page 9: DIS Install Guide Book

Introduction & Overview

2

when using D352 adapters, these rings can be considered to the the columnsand rows. These rings are called ringlets.

frontend The single computer that is running software that monitors and controls thenodes in the cluster. For increased fault tolerance, the frontend should not bepart of the Dolphin Express interconnect it controls, although this is possible.Instead, the frontend should communicate with the nodes out-of-band, whichmeans via Ethernet.

installation machine The installation script is typically executed on the frontend, but can also beexecuted on another machine that is neither a node nor the frontend, but hasnetwork (ssh) access to all nodes and the frontend. This machine is the instal-lation machine.

kernel build machine The interconnect drivers are kernel modules and thus need to be built for theexact kernel running on the node (otherwise, the kernel will refuse to loadthem). To build kernel modules on a machine, the kernel-specific include filesand kernel configuration have to be installed - these are not installed by de-fault on most distributions. You will need to have one kernel build machineavailable which has these files installed (contained in the kernel-devel RPMthat matches the installed kernel version) and that runs the exact same kernelversion as the nodes. Typically, the kernel build machine is one of the nodesitself, but you can choose to build the kernel modules on any other machinethat fulfills the requirements listed above.

cluster All nodes constitute the cluster.

network manager The network manager is a daemon process named dis_networkmgr runningon the frontend. It is part of the Dolphin software stack and manages and con-trols the cluster using the node managers running on all nodes. The networkmanager knows the interconnect status of all nodes.

node manager The node manager provides is a daemon process that is running on each nodeand provides remote access to the interconnect driver and other node status tothe network manager. It reports status and performs actions like configuringthe installed adapter or changing the interconnect routing table if necessary.

self-installing archive (SIA) A self-installing archive (SIA) is a single executable shell command file (forLinux and Solaris) that is used to compile and install the Dolphin softwarestack in all required variants. It largely simplifies the deployment and man-agement of a Dolphin Express-based cluster.

Scalable Coherent Interface(SCI)

Scalable Coherent Interface is one of the interconnect implementations thatcan be used with Dolphin Express software, like SuperSockets and SISCI. SCIis an IEEE standard; the implementation offered by Dolphin are the D33x andD35x series of adapter cards.

SISCI SISCI (Software Infrastructure for SCI) is the user-level API to create appli-cations that make direct use of the Dolphin Express interconnect capabilities.Despite its inherited name, it also supports other interconnect implementationsoffered by Dolphin, like DSX.

4. Contact & Feedback: Dolphin SupportIf you have any problems with the procedures described in this document, or have suggestions for improvement,please don't hesitate to contact Dolphin's support team via <[email protected]>. For updated versions ofthe software stack and this document, please check the download section at http://www.dolphinics.com.

Page 10: DIS Install Guide Book

3

Chapter 2. Requirements & PlanningBefore you deploy a Dolphin Express solution by either adding it to an existing system, or by planning it into anew system, some considerations on selection of products and the physical setup are necessary.

1. Supported PlatformsThe Dolphin Express software stack is designed to run on all current cluster hard- and software platforms, andalso supports and adapts to platforms that are several years old to ensure long-term support. Generally, Dolphinstrives to support every platform that can run any version of Windows, Linux or Solaris and offers a PCI (Express)slot. Next to this general approach, we qualify certain platforms with our partners which then are guaranteed torun and perform optimally the qualified application. We also test platforms internally and externally for generalfunctionality and performance. For details, please see Appendix D, Platform Issues and Software Limitations.

1.1. Hardware

1.1.1. Supported Platforms

The Dolphin Express hardware (interconnect adapters) complies to the PCI industry standard (either PCI 2.264bit/66MHz or PCI-Express 1.0a) and will thus operate in any machine that offers compliant slots. SupportedCPU architectures are x86 (32 and 64 bit), PowerPC and PowerPC64, Sparc and IA-64.

However, some combinations of CPU and chipset implementations offer sub-optimal performance which shouldbe considered when planning a new system. A few cases are documented in which bugs in the chipset have shownup with our interconnect as it puts a lot of load onto the related components.

For the hardware platforms qualified or tested with Dolphin Express, please see Appendix D, Platform Issues andSoftware Limitations. If you have questions about your specific hardware platform, please contact Dolphin support.

1.1.2. Recommended Node Hardware

The hardware platform for the nodes should be chosen from the Supported Platforms as described above. Next tothe Dolphin Express specific requirements, you need to consult your MySQL Cluster expert / consultant on therecommended configuration for your application.

You need to make sure that each node / machine has one full height, half length PCI/PCI-X/PCI-Express slotavailable. The power consumption of Dolphin Express adapters is between 5W and 15W (Consult the separatedata sheets for detail).

Note

Half-height slots can be used with Dolphin DXH-series of adapters. A half-height version of the SCI-adapters will be available soon; please contact Dolphin support for availability.

The Dolphin Express interconnect is fully inter-operable between all supported hardware platforms, also withdifferent PCI or CPU architectures. As usual, care must be taken by the applications if data with different endianessis communicated.

1.1.3. Recommended Frontend Hardware

The frontend does only run a lightweight network manager service which does not impose special hardware prop-erties. However, the frontend should not be fully loaded to ensure fast operation of the network manager service.The frontend requires a reliable Ethernet connection to all nodes.

1.2. Software

Dolphin Express supports a variety of operating systems that are listed below.

Page 11: DIS Install Guide Book

Requirements & Planning

4

1.2.1. Linux

The Dolphin Express software stack can be compiled for all 2.6 kernel versions and most 2.4 kernel versions.A few extra packages (like the kernel include files and configuration) need to be installed for the compilation.Dolphin does only provide source-based distributions which are to be compiled for the exact kernel and hardwareversion you are using. Software stacks operating on different kernel versions are of course fully inter-operablefor inter-node communication.

Dolphin Express fully supports native 32-bit and 64-bit platforms. On 64-bit platforms offering both, 32-bit and64-bit runtime environments, SuperSockets will support 32-bit applications if the compilation environment for32-bit is also installed. Otherwise, only the native 64-bit runtime environment is supported. For more information,please refer to the FAQ chapter, Q: 2.1.6.

Please refer to the release notes of the software stack version you are about to install for the current list of testedLinux distributions and kernel versions. Installation and operation on Linux distributions and kernel versions thatare not in this list will usually work as well, but especially the most recent Linux version may cause problems ifit has not yet been qualified by Dolphin.

1.2.2. Windows

The Dolphin Express software stack operates on 32- and 64-bit versions of Windows NT 4.0, Windows 2000,Windows 2003 Server and Windows XP. We provide MSI binary installer packages for each Windows version.

TBA: More information on the available components under Windows.

1.2.3. Solaris

Solaris™ 2.6 through 9 on Sparc is supported (excluding SuperSockets).

MySQL Cluster can be used via the SCI Transporter™ provided by MySQL, using the SISCI interface providedby Dolphin.

Note

Support for Solaris 10™ on Sparc™ and AMD64 (x86_64) including SuperSockets is under develop-ment. Ask Dolphin support for the current status.

1.2.4. Others

The Dolphin Express software stack excluding SuperSockets also run on VxWorks™, LynxOS™ and HP-UX™.Contact Dolphin support with your requirements.

2. Interconnect PlanningThis section discusses the decisions that are necessary when planning to install a Dolphin Express Interconnect.

2.1. Nodes to Equip with Dolphin Express Interconnect

Depending on the application that will run on the cluster, the choice of Dolphin Express Interconnect equippedmachines differs.

2.1.1. MySQL Cluster

For best performance, all machines that run either NDB or MySQL server processes should be interconnected withthe Dolphin Express interconnect. Although it is possible to only equip a subset of the machines with DolphinExpress, doing so will introduce new bottlenecks.

Machines that serve as application servers (clients sending queries to MySQL servers processes) typically have lit-tle benefit of being part of the interconnect. Please analyze the individual scenario for a definitive recommendation.

Page 12: DIS Install Guide Book

Requirements & Planning

5

Machines that serve as MySQL frontends (like to run the MySQL Cluster management daemon ndb_mgmd) donot benefit from the Dolphin Express interconnect.

Note

The machine that runs the Dolphin network manager must not be equipped with the Dolphin Expressinterconnect as this reduces the level of fault tolerance.

2.2. Interconnect Topology

For small clusters of just two nodes, a number of possible approaches do exist:

• Lowest cost: Connect the two nodes with one single-channel D351 adapter in each node.

• Highest Performance: Connect the two nodes with one dual-channel D350 adapter in each node. This increasesthe bandwidth and adds redundancy at the same time.

• Best Scalability: Use D352 adapters to connect the two nodes. The second dimension of this 2D adapter willnot be used, but it is possible to expand this cluster in a fault-tolerant way.

When going to 4 nodes or more, the topology has to be the multi-dimensional torus, typically a 2D-torus built withD352 PCI-Express-to-SCI adapters. With D352, you can build clusters of any size up to a recommend maximumof 256 nodes. This is the typical topology for database clusters. For large clusters, it is possible to use a 3-D torustopology.

For 3- and 4-node cluster, special topologies without any fail-over delays can be built.

It is possible to install more than one interconnect fabric in a cluster by installing two adapters into each node.These interconnect fabrics work independently from each other and do very efficiently and transparently increasethe bandwidth and throughput (a real factor of two for two fabrics) and add redundancy at the same time.

Note

For some chipsets, PCI performance does not scale well, reducing the performance improvement of asecond fabric. If this feature is important for you, contact Dolphin Support to make sure that you chosea chipset for which the full performance will be delivered.

2.3. Physical Node Placement

Generally, nodes that are to be equipped with Dolphin Express interconnect adapters should be placed close toeach other to keep cable lengths short. This reduces costs and allows for better routing of the cables.

For larger clusters, it makes sense to arrange the nodes in analogy to the the regular 2D-torus topology of theinterconnect.

The maximum cable length is 10m for copper cables. For situations where nodes need to be placed in a significantdistance, it is possible to use fiber instead of copper for the interconnect. Please ask Dolphin support for details.

The minimal cable bend radius is 25mm, which allows to place nodes 5cm apart. Connecting nodes or blades thatare less than 5cm apart, like for 1U nodes in a 19" rack being a single rack unit apart, typically causes no problemas long as the effective bend radius is 25mm or more. Please contact Dolphin support for other cabling scenarios.

Once you have decided on the physical node placement, the cables should be ordered in the according lengths.The Dolphin sales engineer will assist you in selecting the right cable lengths.

Page 13: DIS Install Guide Book

6

Chapter 3. Initial InstallationThis chapter guides you through the initial hardware and software installation of Dolphin Express and SuperSock-ets. This means that no Dolphin software is installed on the nodes or the frontend prior to these instructions. Toupdate an existing installation, please refer to Chapter 4, Update Installation; to add new nodes to a cluster, pleaserefer to “Adding Nodes”.

The recommended installation procedure, which is described in this chapter, uses the Self-Installing Archive (SIA)distribution of the Dolphin Express software stack which can be used for all Linux versions that support the RPMpackage format. Installation on other Linux platforms is covered in section Section 4, “Unpackaged Installation”.

1. Overview

The initial installation of Dolphin Express hardware and software will follow these steps that are described indetail in the following sections:

1. Verification of installation requirements:

• Study section Section 2, “Installation Requirements”.

• The setup script itself will also verify that these requirements are met and indicate what is missing.

2. Installation of interconnect adapters on the nodes (see Section 3, “Adapter Card Installation”).

Note

The cables should not be installed in this step!

3. Installation of software and cables. This step is refined in section Section 4, “Software and Cable Installation”.

2. Installation Requirements

For the SIA-based installation of the full cluster and the frontend, the following requirements have to be met:

• Homogeneous cluster nodes: All nodes of the cluster are of the same CPU architecture and run the same kernelversion. The frontend machine may be of different CPU architecture and kernel version!

Note

The installation of the Dolphin Express software on a system that does not satisfy this requirement isdescribed in Section 2, “Installation of a Heterogeneous Cluster”

• RPM support: The Linux distribution on the nodes, the frontend and the installation machine needs to supportRPM packages. Both major distributions from Red Hat and Novell (SuSE) use RPM packages.

Note

On platforms that do not support RPM packages, it is also possible to install the Dolphin Expresssoftware. Please see Section 4, “Unpackaged Installation” for instructions.

• Installed RPM packages: To build the Dolphin Express software stack, a few RPM packages that are often notinstalled by default are required:

qt and qt-devel (> version 3.0.5), , glibc-devel and libgcc (32- and 64-bit, depending on what binary for-mats should be supported), rpm-build, and the kernel header files and configuration (typically a kernel-develor kernel-source RPM that exactly(!) matches the version of the installed kernel)

Page 14: DIS Install Guide Book

Initial Installation

7

Note

The SIA will check for these packages, report what packages might be missing and will offer to installthem if the yum RPM management system is supported on the affected machine. All required RPMpackages are within the standard set of RPM packages offered for your Linux distribution, but maynot be installed by default.

If the qt-RPMs are not available, the Dolphin Express software stack can be built nevertheless, butthe GUI applications to configure and manage the cluster will not be available. Please see below (Sec-tion 2.2, “Non-GUI Installation”) on how to install the software stack in this case.

• GUI support: for the initial installation, the installation machine should be able to run GUI application via X. .

Note

If the required configuration files are already available prior to the installation, a GUI is not required(see section Section 2.2, “Non-GUI Installation”).

• Disk space: To build the RPM packages, about 500MB free disk space in the system's temporary directory(typically /tmp on Linux) are required on the kernel build machine and the frontend.

Note

It is possible to assign SIA to use a specific temporary directory for building using the --build-rootoption.

2.1. Live Installation

Dolphin Express™ can be installed into a cluster which is currently under operation without stopping the clusterapplication from working. This requires that the application running on the cluster can cope with single nodesgoing down. It is only necessary to turn off each node once to install the adapter. The software installation canbe performed under load, although minor performance impacts are possible. For a description of this installationtype, please proceed as described in Section 1, “Installation under Load”

2.2. Non-GUI Installation

The Dolphin software includes two GUI tools:

• dishostseditor is a tool that is used to create the interconnect configuration file /etc/dis/dishosts.confand the network manager configuration file /etc/dis/networkmanager.conf. It is needed once on the initialcluster installation, and each time nodes are added or removed from the cluster.

• sciadmin is used to monitor and control the cluster interconnect.

2.2.1. No X / GUI on Frontend

If the frontend does not support running GUI applications, but another machine in the network does, it is possibleto run the installation on this machine. The only requirement is ssh-access from the installation machine towardsthe frontend and all nodes. This installation mode can be chosen by executing the SIA on the installation machineand specifying the frontend name when being asked for it.

In this scenario, the dishostseditor will be compiled, installed and executed on the installation machine, and thegenerated configuration files will be transferred to the frontend by the installer.

2.2.2. No X / GUI Anywhere

If no machine in the network does have the capability to run GUI applications, you can still use the SIA-basedinstallation. In this case, it is necessary to create the correct configuration files on another machine and store themin /etc/dis on the frontend before executing the SIA on the frontend (not on another machine).

Page 15: DIS Install Guide Book

Initial Installation

8

In this scenario, no GUI application is run at all during the installation. To create the the configuration files onanother machine, you can either run the SIA with the --install-editor option if it is a Linux machine, orinstall a binary version of the dishostseditor if it is a Windows-based machine. Finally, you can send the necessaryinformation to create the configuration files to Dolphin support which will then provide you with the matchingconfiguration files and the cabling instructions. This information includes:

• external hostnames (or IP adresses) of all nodes

• adapter type and number of fabrics (1 or 2)

• hostnames (or IP adresses/subnet) which should be accelerated with SuperSockets (default is the list of host-names provided above)

• planned interconnect topology (default is derived from number of nodes and adapter type)

• description of how nodes are physically located (to avoid cabling problems)

3. Adapter Card InstallationTo install an adapter into a node, it needs to be powered down. Insert the adapter into a free PCI slot that matchesthe adapter:

• any type of PCI slot for D33x adapters

• 4x, 8x or 16x PCI-Express slot for D351 and D352

• 8x or 16x PCI-Express slot for D350 adapters

Make sure you are properly grounded to avoid static discharges that may destroy the hardware. When the cardis properly inserted and fixed, close the enclosure and power up the node again. The LED(s) on the adapter slotcover has to light orange.

Proceed this way with all nodes. It is recommended that all nodes are set up identically, thus use the same slotfor the adapter (if the node hardware is identical).

You do not yet need to connect the cables at this point: detailed cabling instructions customized for the specificcluster will be created during the software installation which will guide you through the cabling. Generally, thecables can safely be connected and disconnected with the nodes powered up as SCI is a fully hot-plug capableinterconnect.

Note

If you know how to connect the cables, you can do so now. Please be advised to inspect and verify thecabling for correctness as described in the remainder of this chapter, though.

4. Software and Cable InstallationAfter the adapters are installed, the software has to be installed next. On the nodes, the hardware driver andadditional kernel modules, user space libraries and the node manager have to be installed. On the frontend, thenetwork manager and the cluster administration tool will be installed.

An additional RPM for SISCI development (SISCI-devel) will be created for both, frontend and nodes, but willnot be installed. It can be installed as needed in case that SISCI-based applications or libraries (like NMPI) needto be compiled from source.

4.1. Overview

The integrated cluster and frontend installation is the default operation of SIA, but can be specified explicitly withthe --install-all option. It works as follows:

Page 16: DIS Install Guide Book

Initial Installation

9

• The SIA is executed on the installation machine with root permissions. The installation machine is typicallythe machine to serve as frontend, but can be any other machine if necessary (see Section 2.2.1, “No X / GUIon Frontend”). The SIA controls the building, installation and test operations on the remote nodes via ssh.Therefore, password-less ssh to all remote nodes is required.

If password-less ssh access is not set up between the installation machine, frontend and nodes, SIA offers to setthis up during the installation. The root passwords for all machines are required for this.

• The binary RPMs for the nodes and the frontend are built on the kernel build machine and the frontend, respec-tively. The kernel build machine needs to have the kernel headers and configuration installed, while the frontendand the installation machine only compile user-space applications.

• The node RPMs with the kernel modules are installed on all nodes, the kernel modules are loaded and the nodemanager is started. At this stage, the interconnect is not yet configured.

• On an initial installation, the dishostseditor is installed and executed on the installation machine to create thecluster configuration files. This requires user interaction.

• The cluster configuration files are transferred to the frontend, and the network manager is installed and startedon the frontend. It will in turn configure all nodes according to the configuration files. The cluster is now readyto utilize the Dolphin Express interconnect.

• A number of tests are executed to verify that the cluster is functional and to get basic performance numbers.

For other operation modes, such to install specific components on the local machine, please refer to Appendix A,Self-Installing Archive (SIA) Reference.

4.2. Starting the Software Installation

Log into the chosen installation machine, become root and make sure that the SIA file is stored in a directory withwrite access (/tmp is fine). Execute the script:

# sh DIS_install_<version>.sh

The script will ask questions to retrieve information for the installation. You will notice that all questions areYes/no questions, and that the default answer is marked by a capital letter, which can be chosen by just pressingEnter. A typical installation looks like this:

[root@scimple tmp]# sh DIS_install_3.3.0.shVerifying archive integrity... All good.Uncompressing Dolphin DIS 3.3.0#* Logfile is /tmp/DIS_install.log_140 on tiger-0

#*#+ Dolphin ICS - Software installation (version: 1.52 $ of: 2007/11/09 16:31:32 $)#+

#* Installing a full cluster (nodes and frontend) .

#* This script will install Dolphin Express drivers, tools and services#+ on all nodes of the cluster and on the frontend node.#+#+ All available options of this script are shown with option '--help'

# >>> OK to proceed with cluster installation? [Y/n]y

# >>> Will the local machine <tiger-0> serve as frontend? [Y/n]y

The default choice is to use the local machine as frontend. If you answer n, the installer will ask you for thehostname of the designated frontend machine. Each cluster needs its own frontend machine.

Please note that the complete installation is logged to a file which is shown at the very top (here: /tmp/DIS_install.log_140). In case of installation problems, this file is very useful to Dolphin support.

Page 17: DIS Install Guide Book

Initial Installation

10

#* NOTE: Cluster configuration files can be specified now, or be generated#+ ..... during the installation.# >>> Do you have a 'dishosts.conf' file that you want to use for installation? [y/N]n

Because this is the initial installation, no installed configuration files could be found. If you have prepared orreceived configuration files, they can be specified now by answering y. In this case, no GUI application needs torun during the installation, allowing for a shell-only installation.

For the default answer, the hostnames of the nodes need to be specified (see below), and the cluster configurationis created later on using the GUI application dishostseditor.

#* NOTE:#+ No cluster configuration file (dishosts.conf) available.#+ You can now specify the nodes that are attached to the Dolphin#+ Express interconnect. The necessary configuration files can then#+ be created based on this list of nodes.#+#+ Please enter hostname or IP addresses of the nodes one per line.#* When done, enter a single colon ('.').#+ (proposed hostname is given in [brackets])# >>> node hostname/IP address <colon '.' when done> []tiger-1# >>> node hostname/IP address <colon '.' when done> [tiger-2]-> tiger-2# >>> node hostname/IP address <colon '.' when done> [tiger-3]-> tiger-3# >>> node hostname/IP address <colon '.' when done> [tiger-4]-> tiger-4# >>> node hostname/IP address <colon '.' when done> [tiger-5]-> tiger-5# >>> node hostname/IP address <colon '.' when done> [tiger-6]-> tiger-6# >>> node hostname/IP address <colon '.' when done> [tiger-7]-> tiger-7# >>> node hostname/IP address <colon '.' when done> [tiger-8]-> tiger-8# >>> node hostname/IP address <colon '.' when done> [tiger-9]-> tiger-9# >>> node hostname/IP address <colon '.' when done> [tiger-10].

The hostnames or IP-addresses of all nodes need to be entered. The installer suggests the hostnames if possible inbrackets. To accept a suggestion, just press Enter. Otherwise, enter the hostname or IP address. The data enteredis verified to represent an accessible hostname. If a node has multiple IP addresses / hostnames, make sure youspecify the one that is visible for the installation machine and the frontend.

When all hostnames are entered, enter a single colon . to finish.

#* NOTE:#+ The kernel modules need to be built on a machine with the same kernel#* version and architecture of the interconnect node. By default, the first#* given interconnect node is used for this. You can specify another build#* machine now.# >>> Build kernel modules on node tiger-1 ? [Y/n]y

If you answer n at this point, you can enter the hostname of another machine on which the kernel modules arebuilt. Make sure it matches the nodes for CPU architecture and kernel version.

# >>> Can you access all machines (local and remote) via password-less ssh? [Y/n]y

The installer will later on verify if the password-less ssh access actually works. If you answer n, the installer willset up password-less ssh for you on all nodes and the frontend. You will need to enter the root password oncefor each node and the password.

The password-less ssh access remain active after the installation. To disable it again, remove the file /root/.ssh/authorized_keys from all nodes and the frontend.

#* NOTE:

Page 18: DIS Install Guide Book

Initial Installation

11

#+ It is recommnended that interconnect nodes are rebooted after the#+ initial driver installation to ensure that large memory allocations will succeed.#+ You can omitt this reboot, or do it anytime later if necesary.# >>> Reboot all interconnect nodes (tiger-1 tiger-2 tiger-3 tiger-4 tiger-5 tiger-6 tiger-7 tiger-8 tiger-9)? [y/N]n

For optimal performance, the low-level driver needs to allocate some amount of kernel memory. This allocationcan fail on a system that has been under load for a long time. If you are not installing on a live system, rebootingthe nodes is therefore offered here. You can perform the reboot manually later on to achieve the same effect.

If chosen, the reboot will be performed by the installer without interrupting the installation procedure.

#* NOTE:#+ About to INSTALL Dolphin Express interconnect drivers on these nodes:... tiger-1... tiger-2... tiger-3... tiger-4... tiger-5... tiger-6... tiger-7... tiger-8... tiger-9#+ About to BUILD Dolphin Express interconnect drivers on this node:... tiger-1#+ About to install management and control services on the frontend machine:... tiger-0#* Installing to default target path /opt/DIS on all machines.. (or the current installation path if this is an update installation).# >>> OK to proceed? [Y/n]y

The installer presents an installation summary and asks for confirmation. If you answer n at this point, the installerwill exist and the installation needs to be restarted.

#* NOTE:#+ Testing ssh-access to all cluster nodes and gathering configuration.#+#+ If you are asked for a password, the ssh access to this node without#+ password is not working. In this case, you need to interrupt with CTRL-c#+ and restart the script answering 'no' to the intial question about ssh.... testing ssh to tiger-1... testing ssh to tiger-2... testing ssh to tiger-3... testing ssh to tiger-4... testing ssh to tiger-5... testing ssh to tiger-6... testing ssh to tiger-7... testing ssh to tiger-8... testing ssh to tiger-9#+ OK: ssh access is working#+ OK: nodes are homogenous#* OK: found 1 interconnect fabric(s).

#* Testing ssh to other nodes... testing ssh to tiger-1... testing ssh to tiger-0... testing ssh to tiger-0#* OK.

The ssh-access is tested, and some basic information is gathered from the nodes to verify that the nodes are ho-mogeneous and equipped with at least on Dolphin Express adapter and meet the other requirements. If a requiredRPM package was missing, it would be indicated here with the option to install it (if yum can be used), or to fixthe problem manually and retry.

If the test for homogeneous nodes failes, please refer to section Section 2, “Installation of a Heterogeneous Cluster”for information on how to install the software stack.

#* Building node RPM packages on tiger-1 in /tmp/tmp.AEgiO27908

Page 19: DIS Install Guide Book

Initial Installation

12

#+ This will take some minutes...#* Logfile is /tmp/DIS_install.log_983 on tiger-1

#* OK, node RPMs have been built.

#* Building frontend RPM packages on scimple in /tmp/tmp.dQdwS17511#+ This will take some minutes...#* Logfile is /tmp/DIS_install.log_607 on scimple

#* OK, frontend RPMs have been built.

#* Copying RPMs that have been built:/tmp/frontend_RPMS/Dolphin-NetworkAdmin-3.3.0-1.x86_64.rpm/tmp/frontend_RPMS/Dolphin-NetworkHosts-3.3.0-1.x86_64.rpm/tmp/frontend_RPMS/Dolphin-SISCI-devel-3.3.0-1.x86_64.rpm/tmp/frontend_RPMS/Dolphin-NetworkManager-3.3.0-1.x86_64.rpm/tmp/node_RPMS/Dolphin-SISCI-3.3.0-1.x86_64.rpm/tmp/node_RPMS/Dolphin-SISCI-devel-3.3.0-1.x86_64.rpm/tmp/node_RPMS/Dolphin-SCI-3.3.0-1.x86_64.rpm/tmp/node_RPMS/Dolphin-SuperSockets-3.3.0-1.x86_64.rpm

The binary RPM packages matching the nodes and frontend are built and copied to the directory from where theinstaller was invoked. They are placed into the subdirectories node_RPMS and frontend_RPMS for later use (seethe SIA option --use-rpms).

#* To install/update the Dolphin Express services like SuperSockets, all running#+ Dolphin Express services needs to be stopped. This requires that all user#+ applications using SuperSockets (if any) need to be stopped NOW.# >>> Stop all DolpinExpress services (SuperSockets) NOW? [Y/n]y

#* OK: all Dolphin Express services (if any) stopped for upgrade.

On an initial installation, there will be now user applications using SuperSockets, so you can easily answer y rightaway.

#* Installing node tiger-1#* OK.#* Installing node tiger-2#* OK.#* Installing node tiger-3#* OK.#* Installing node tiger-4#* OK.#* Installing node tiger-5#* OK.#* Installing node tiger-6#* OK.#* Installing node tiger-7#* OK.#* Installing node tiger-8#* OK.#* Installing node tiger-9#* OK.#* Installing machine scimple as frontend.#* NOTE:#+ You need to create the cluster configuration files 'dishosts.conf'#+ and 'networkmanager.conf' using the graphical tool 'dishostseditor'#+ which will be launched now.#+#+ If the interconnect cables are not yet installed, you can create#+ detailed cabling instruction within this tool (File -> Get Cabling Instructions).#+ Then install the cables while this script is waiting.

# >>> Are all cables connected, and do all LEDs on the SCI adapters light green? [Y/n]

The nodes get installed and drivers and the node manager are started. Then, the basic packages are installed onthe frontend, and the dishostseditor application is launched to create the required configuration files /etc/dis/dishosts.conf and /etc/dis/networkmanager.conf if they do not already exist. The script will wait at this

Page 20: DIS Install Guide Book

Initial Installation

13

point until the configuration files have been created with disthostseditor, and until you confirm that all cables havebeen connected according to the cabling instructions. This is described in the next section.

For typical problems at this point of the installation, please refer to section Chapter 9, FAQ.

4.3. Working with the dishostseditor

dishostseditor is a GUI tool that helps gathering the cluster configuration (and is used to create the clus-ter configuration file /etc/dis/dishosts.conf and the network manager configuration file /etc/dis/networkmanager.conf). A few global interconnect properties need to be set, and the position of each node withinthe interconnect topology needs to be specified.

4.3.1. Cluster Edit

When dishostseditor is launched, it first displays a dialog box where the global interconnect properties need to bespecified (see Figure 3.1, “Cluster Edit dialog of dishostseditor”).

Figure 3.1. Cluster Edit dialog of dishostseditor

4.3.1.1. Topology

The dialog will let you enter he selected topology information (number of nodes in X-, Y- and Z- dimension)according to the topology type you selected. The product of all nodes in every dimension needs to be equal (forregular topologies) or less (for irregular topology variants). The number of fabrics needs to be set to the minimumnumber of adapters in every node.

The topology settings should already be correct by default if dishostseditor is launched by the installation script.If the cables are not yet mounted (which is the recommended way of doing it), you simply choose the settingsthat matches the way you plan to install.

However, if the cables are already in place, it is critical to verify that the actual cable installation matches thedimensions shown here if you install a cluster with a 2D- or 3D-torus interconnect topology. I.e., a 12 node cluster

Page 21: DIS Install Guide Book

Initial Installation

14

can be set up as 3 by 4 or 4 by 3 or even 2 by 6, the setup script cannot verify that the cabling matches the dimensionsthat you selected. Remember that link 0 on the adapter boards (the one where the plug is right on the PCB of theadapter board) is mapped to the X-dimension, and link 1 on the adapter board (the one where the plug is on thepiggy-back board) is mapped to the Y-dimension.

4.3.1.2. SuperSockets Network Address

If your cluster operates within its own subnet and you want all nodes within this subnet to use SuperSockets(having Dolphin Express installed), you can simplify the configuration by specifying the address of this subnet inthis dialog. To do so, activate the Network Address field and enter the cluster IP subnet address including the mask.I.e., if all your node communicate via an IP interface with the address 192.168.4.*, you would enter 192.168.4.0/8here. If the cluster has its own subnet, this option is recommend.

SuperSockets will try to use the Dolphin Express for any node in this subnet when it connects to another nodeof this subnet. If using Dolphin Express is not possible, i.e. because one or both nodes are only equipped with anEthernet interface, SuperSockets will automatically fall back to Ethernet. Also, if a node gets assigned a new IPaddress within this subnet, you don't need to change the SuperSockets configuration. Assigning more than onesubnet to SuperSockets is also possible, but this type of configuration is not yet supported by dishostseditor. Seesection Section 1.1, “dishosts.conf” on how to edit dishosts.conf accordingly.

If this type of configuration is not possible in your environment, you need to configure SuperSockets for eachnode as described in the following section.

4.3.1.3. Status Notification

In case you want to be informed on any change of the interconnect status (i.e. an interconnect link was disableddue to errors, or a node has gone down and the interconnect traffic was rerouted), active the checkbox Alert targetand enter the alert target and the alert script to be executed. The default alert script is alert.sh and will send ane-mail to the address specified as alert target.

Other alert scripts can be created and used, which may require another type of alert target (i.e. a cell phone numberto send an SMS). For more information on using status notification, please refer to Section 1, “Notification onInterconnect Status Changes”.

4.3.2. Node Arrangement

In the next step, the main pane of the dishostseditor will present the nodes in the cluster arranged in the topologythat was selected in the previous dialog. To change this topology and other general interconnect settings, you canalways click Edit in the Cluster Configuration area which will bring up the Cluster Edit dialog again.

If the font settings of your X server cause dishostseditor to print unreadable characters, you can change the fontsize and the type with the drop-down box at the top of the windows, next to the floppy disk icon.

Page 22: DIS Install Guide Book

Initial Installation

15

Figure 3.2. Main dialog of dishostseditor

At this point, you need to arrange the nodes (marked by their hostnames) such that the placement of each nodein the torus as shown by dishostseditor matches its placement in the physical torus. You do this by assigning thecorrect hostname for each node by double-clicking its node icon which will open the configuration dialog of thisnode. In this dialog, select the correct machine name, which is the hostname as seen from the frontend, from thedrop-down list. You can also type a hostname if a hostname that you specified during the installation was wrong.

Page 23: DIS Install Guide Book

Initial Installation

16

Figure 3.3. Node dialog of dishostseditor

After you have assigned the correct hostname to this machine, you may need to configure SuperSockets on thisnode. If you selected the Network Address in the cluster configuration dialog (see above), then SuperSocketswill use this subnet address and will not allow for editing this property on the nodes. Otherwise, you can choosebetween 3 different options for each of the currently supported 2 SuperSocket-accelerated IP interfaces per node:

disable Do not use SuperSockets. If you set this option for both fields, SuperSockets can not be used withthis node, although the related kernel modules will still be loaded.

static Enter the hostname or IP address for which SuperSockets should be used. This hostname or IP addresswill be statically assigned to this physical node (its DolpinExpress interconnect adapter).

Choosing a static socket means that the mapping between the node (its adapters) and the specifiedhostname/IP address is static and will be specified within the configuration file dishosts.conf. Allnodes will use this identical file (which is automatically distributed from the frontend to the nodesby the network manager) to perform this mapping.

This option works fine if the nodes in your cluster don't change their IP addresses over time.

dynamic Enter the hostname or IP address for which SuperSockets should be used. This hostname or IP ad-dress will be dynamically resolved to the DolpinExpress interconnect adapter that is installed in themachine with this hostname/IP address. SuperSockets will therefore resolve the mapping betweenadapters and hostnames/IP addresses dynamically. This incurs a certain initial overhead when thefirst connection is set up, but as the mapping is cached, this is not really relevant.

This option is similar to using a subnet, but resolves only the explicitly specified IP addresses andnot all IP addresses of a subnet. Use this option if nodes change their IP addresses or node identitiesmove between physical machines, i.e. in a fail-over setup.

4.3.3. Cabling Instructions

You should now generate the cabling instructions for your cluster. Please do this also when the cables are actuallyinstalled: you really want to verify if the actual cable setup matches the topology you just specified. To createthe cabling instruction, choose the menu item File Create Cabling Instructions. You can save and/or print theinstructions. It is a good idea to print the instructions so you can take them with you to the cluster.

Page 24: DIS Install Guide Book

Initial Installation

17

4.4. Cluster Cabling

If the cables are already connected, please proceed with section Section 4.4.2, “Verifying the Cabling”.

Note

In order to achieve a trouble-free operation of your cluster, setting up the cables correctly is critical.Please take your time to perform this task properly.

The cables can be installed while nodes are powered up. The setup script will wait with a question for you tocontinue:

# >>> Are all cables connected, and do all LEDs on the SCI adapters ligtht green? [Y/n]

4.4.1. Connecting the cables

Please proceed by connecting the nodes as described by the cabling instructions generated by the dishostseditor.The cabling instructions refer to link 0 and link 1 if you are using D352 adapters (for 2D-torus topology), andchannel A and channel B in case of D350 adapters being used (for dual-channel operation). Each of the twolinks/channels will form an independent ring with its adjacent adapters, and thus has an IN and OUT connector toconnect to these adjacent adapters. It is critical that you correctly locate the different links/channels on the backof the card, and the IN and OUT connectors.

For D352 (D350), link 0 (channel A) is formed by the connectors that are directly connected to the PCB (printedcircuit board) of the adapter, while the connectors for link 1 (channel B) are located on the piggy-back board. Thisis illustrated in Figure 3.4, “Location of link 0 and link 1 on D352 adapter”and Figure 3.5, “Location of channelA and channel B on D350 adapter”, respectively. For both links (channels), the IN connectors are located at thelower end of the adapter, and the OUT connectors at the top of the adapter. The D351 adapter has only a singlelink (0), and the same location of IN and OUT connectors.

Figure 3.4. Location of link 0 and link 1 on D352 adapter

Figure 3.5. Location of channel A and channel B on D350 adapter

Please consider the hints below for connecting the cables:

• Never apply force:

• The plugs of the cable will move into the sockets easily. Make sure the orientation is correct.

• The cables have a minimum bend diameter of 5cm.

Page 25: DIS Install Guide Book

Initial Installation

18

Note

This specification applies to black All-Best cables (part number D706), but not to the grey CHHcables (part number D707). With the CHH cables, the minimum bend diameter is 10cm.

• Fasten evenly. When fastening the screws of the plugs, make sure you fasten both lightly before tighteningthem. Do not tighten only one screw of the plug, and then the other one, as this is likely to tilt the plug withinthe connector.

• Fasten gently. Use a torque screw driver if possible, and apply a maximum of 0.4 Nm. As a rule of thumb:do not apply more torque with the screw driver than you possibly could using only your finger (if there wasenough space to grip the screw).

• Observe LEDs: When an adapter has both input and output of a link connected to it's neighboring adapter, theLED should turn green and emit a steady light (not blinking).

• Don't mix up links: When using a 2D-torus topology, it is important not to connect link 0 of one adapter withlink 1 of another adapter. As decribed above, link 0 is the left pair of connectors on the Dolphin Express SCIinterconnect adapter when the adapter is placed in a vertical position. In order to determine a left side this youmay hold the Dolphin Express interconnect adapter in a vertical position:

• the blue "O" (indicating the OUT port) should be located at the top.

• LEDs are also placed on the top of the adapter

• the PCI/PCI-X/PCI-Express bus connector is mounted on the lower side of the adapter

The left pair of connectors on the Dolphin Express interconnect adapter is what we refer to as Link 0. Link1 is the right pair of connectors on the Dolphin Express interconnect adapter when the adapter is placed in avertical position.

Note

If the links have been mixed up, the LED will still turn green, but packet routing will fail. The cablingtest of sciadmin will reveal such cabling errors.

4.4.2. Verifying the Cabling

Important

A green link LED indicates that the link between the output plug and input plug could be establishedand synchronized. It does not assure that the cable is actually placed correctly! It is therefore importantto verify once more that the cables are plugged according to the cabling instructions generated by thedishostseditor!

If a pair of LEDs do not turn green, please perform the following steps:

• Disconnect the cables. Make sure you connect an Output with an Input plug. Re-insert and fasten the plugaccording to the guidelines above.

• If the LEDs still do not turn green, use a different cable.

• If the LEDs still do not turn green, swap the cable of the problematic connection with a working one and observeif the problem moves with the cable.

• Power-cycle the nodes with the orange LEDs according to Q: 1.1.1.

• Contact Dolphin support if you can not make the LEDs turn green after trying all proposed measures.

Page 26: DIS Install Guide Book

Initial Installation

19

When you are done connecting the cables, all LEDs have turned green and you have verified the connections, youcan answer "Yes" to the question "Are all cables connected, and do all LEDs on the SCI adapters ligtht green? "and proceed with the next section to finalize the software installation.

4.5. Finalising the Software Installation

Once the cables are connected, no more user interaction is required. Please confirm that all cables are connectedand all LEDs shine gren, and the installation will proceed. The network manager will be started on the frontend,configuring all cluster nodes according to the configuration specified in dishosts.conf. After this, a numberof tests are run on the cluster to verify that the SCI interconnect was set up correctly and delivers the expectedperformance. You will see output like this:

#* NOTE: checking for cluster configuration to take effect:... node tiger-1:... node tiger-2:... node tiger-3:... node tiger-4:... node tiger-5:... node tiger-6:... node tiger-7:... node tiger-8:... node tiger-9:#* OK.

#* Installing remaining frontend packages

#* NOTE:#+ To compile SISCI applications (like NMPI), the SISCI-devel RPM needs to be#+ installed. It is located in the frontend_RPMS and node_RPMS directories.

#* OK.

If no problems are reported (like in the example above), you are done with the installation and can start to useyour Dolphin Express accelerated cluster. Otherwise, refer to the next subsections and Section 4.7, “InterconnectValidation with sciadmin” to learn about the individual tests and how to fix problems reported by each test.

4.5.1. Static SCI Connectivity Test

The Static SCI Connectivity Test verifies that links are up and all nodes can see each other via the SCI interconnect.Success in this test means that all adapters have been configured correctly, and that the cables are inserted properly.It should report TEST RESULT: *PASSED* for all nodes:

#* NOTE: Testing static interconnect connectivity between nodes.... node tiger-1:TEST RESULT: *PASSED*... node tiger-2:TEST RESULT: *PASSED*... node tiger-3:TEST RESULT: *PASSED*... node tiger-4:TEST RESULT: *PASSED*... node tiger-5:TEST RESULT: *PASSED*... node tiger-6:TEST RESULT: *PASSED*... node tiger-7:TEST RESULT: *PASSED*... node tiger-8:TEST RESULT: *PASSED*... node tiger-9:TEST RESULT: *PASSED*

If this test reports errors or warning, you are offered to re-run dishostseditor to validate and possibly fix theinterconnect configuration. If the problems persist, you should let the installer continue and analyse the problemsusing sciadmin after the installation finishes (see Section 4.7, “Interconnect Validation with sciadmin”).

Page 27: DIS Install Guide Book

Initial Installation

20

4.5.2. SuperSockets Configuration Test

The SuperSockets Configuration Test verifies that all nodes have the same valid SuperSockets configuration (asshown by /proc/net/af_sci/socket_maps).

#* NOTE: Verifying SuperSockets configuration on all nodes.#+ No SuperSocket configuration problems found.

Success in this test means that the SuperSockets service dis_supersockets is running and is configured identicallyon all nodes. If a failure is reported, it means the the interconnect configuration did not propagate correctly to thisnode. You should check if the dis_nodemgr service is running on this node. If not, start it, wait for a minute, andthen configure SuperSockets by calling dis_ssocks_cfg.

4.5.3. SuperSockets Performance Test

The SuperSockets Performance Test runs a simple socket benchmark between two of the nodes. The benchmarkis run once via Ethernet and once via SuperSockets, and performance is reported for both cases.

#* NOTE:#+ Verifying SuperSockets performance for tiger-2 (testing via tiger-1).#+ Checking Ethernet performance... single-byte latency: 56.63 us#+ Checking Dolphin Express SuperSockets performance... single-byte latency: 3.00 us... Latency rating: Very good. SuperSockets are working well.#+ SuperSockets performance tests done.

The SuperSockets latency is rated based on our platform validation experience. If the rating indicates that Super-Sockets are not performing as expected, or if it shows that a fallback to Ethernet has occurred, please contactDolphin Support. In this case, it is important that you supply the installation log (see above).

The installation finishes with the option to start the administration GUI tool sciadmin, a hint to use LD_PRELOADto make use of SuperSockets and a pointer to the binary RPMs that have been used for the installation.

#* OK: Cluster installation completed.

#+ Remember to use LD_PRELOAD=libksupersockets.so for all applications that#+ should use Dolphin Express SuperSockets.# >>> Do you want to start the GUI tool for interconnect adminstration (sciadmin)? [y/N]n

#* RPM packages that were used for installation are stored in#+ /tmp/node_PRMS and /tmp/frontend_PRMS.

4.6. Handling Installation Problems

If for some reason the installation was not successful, you can easily and safely repeat it by simply invoking theSIA again. Please consider:

• By default, existing RPM packages of the same or even more recent version will not be replaced. To enforcere-installation with the version provided by the SIA, you need to specify --enforce.

• To avoid that the binary RPMs are built again, use the option --use-rpms or simply run the SIA in the samedirectory as before where it can find the RPMs in the node_RPMS and frontend_RPMS subdirectories.

• To start an installation from scratch, you can run the SIA on each node and the frontend using the option --wipeto remove all traces of the Dolphin Express software stack and start again.

If you still fail to install the software successfully, you should contact Dolphin support. Please provide all instal-lation logfiles. Every installation attempt creates a differently named logfile; it's name is printed at the very begin-ning of the installation. Please also include the configuration files that can be found in /etc/dis on the frontend.

4.7. Interconnect Validation with sciadmin

Dolphin provides a graphical tool named sciadmin. sciadmin serves as a single-point-of-control and manage theDolphin Express interconnect in your cluster. It shows an overview of the status of all adapters and links of a

Page 28: DIS Install Guide Book

Initial Installation

21

cluster and allows to perform detailed status queries. It also provides means to manually control the interconnect,inspect and set options and perform interconnect tests. For a complete description of sciadmin, please refer toAppendix B, sciadmin Reference. Here, we will only describe how to use sciadmin to verify the newly installedDolphin Express interconnect.

4.7.1. Installing sciadmin

sciadmin had been installed on the frontend machine by the SIA if this machine is capable to run X applications andhas the Qt toolkit installed. If the frontend does not have these capabilities, you can install it on any other machinethat has these capabilities using SIA with the --install-frontend option, or use the Dolphin-NetworkAdmin RPMpackage from the frontend_RPMS directory (this RPM will only be there if it could be build for the frontend).

It is also possible to download a binary version for Windows that runs without the need for extra compilationor installation.

You can use sciadmin on any machine that can connect to the network manager on the frontend via a standardTCP/IP socket. You have to make sure that connections towards the frontend using the ports 3445 (sciadmin),3444 (network manager) and 3443 (node manager) are possible (potentially firewall settings need to be changed).

4.7.2. Starting sciadmin

sciadmin will be installed in the sbin directory of the installation path (default: /opt/DIS/sbin/sciadmin). Itwill be within the PATH after you login as root, but can also be run by non-root users. After it has been started,you will need to connect to the network manager controlling your cluster. Click the Connect button in the tool barand enter the appropriate hostname or IP address of the network manager. sciadmin will present you a graphicalrepresentation of the cluster nodes and the interconnect links between them.

4.7.3. Cluster Overview

Normally, all nodes and interconnect links should be shown green, meaning that their status is OK. This is a re-quirement for a correctly installed and configured cluster and you may proceed to Section 4.7.4, “Cabling Cor-rectness Test”.

If a node is plotted red, it means that the network manager can not connect to the node manager on this node.To solve this problem:

1. Make sure that the node is powered and has booted the operating system.

2. Verify that the node manager service is running:

On Red Hat:

# service dis_nodemgr status

On other Linux variants:

# /etc/init.d/dis_nodemgr status

should tell you that the node manager is running. If this is not the case:

a. Try to start the node manager:

On Red Hat:

# service dis_nodemgr start

On other Linux variants:

# /etc/init.d/dis_nodemgr start

should tell you that the node manager has started successfully.

b. If the node manager fails to start, please see /var/log/dis_nodemgr.log

Page 29: DIS Install Guide Book

Initial Installation

22

c. Make sure that the service is configured to start in the correct runlevel (Dolphin installation makes surethis is the case).

On Red Hat:

# chkconfig --add 2345 dis_nodemgr on

On other Linux variants, please refer to the system documentation to determine the required steps

4.7.4. Cabling Correctness Test

sciadmin can validate that all cables are connected according to the configuration that was specified in thedishostseditor, and which is now stored in /etc/dis/dishosts.conf on all nodes and the frontend. To performthe cable test, select Cluster -> Test Cable Connections. This Cabling Correctness Test runs for only a few secondsand will verify that the nodes are cabled according to the configuration provided by the dishostseditor.

Warning

Running this test will stop the normal traffic over the interconnect as the routing needs to be changed.If you run this test while your cluster is in production, you might experience communication timeouts.SuperSockets in operation will fall back to Ethernet during this test, which also leads to increased com-munication delays.

If the test detects a problem, it will inform you that node A can not communicate with node B although they aresupposed to be within the same ringlet. You will typically get more than one error message in case of a cablingproblem, as such a problem does in most cases affect more than one pair of nodes. Please proceed as follows:

1. Try to fix the first reported problem by tracing the cable connections from node A to node B:

a. Verify that the cable connections are placed within one ringlet:

i. Look up the path of cable connections between node A and node B in the Cabling Instructions thatyou created (or still can create at this point) using dishostseditor.

ii. When you arrive at node B, do the same check for the path back from node B to node A.

b. Along the path, make sure:

i. That each cable plug is securely fitted into the socket of the adapter.

ii. Each cable plug is connected to the right link (0 or 1) as indicated by the cabling instructions.

2. If you can't find a problem for the first problem reported, verify the cable connections for all following pairsof node reported bad.

3. After the first change, re-run the cable test to verify if this change solves all problems. If this is not the case,start over with this verification loop.

4.7.5. Fabric Quality Test

The Cable Correctness Test performs only minimal communication between two nodes to determine the function-ality of the fabric between them. To verify the actual signal quality of the interconnect fabric, a more intense testis required. Such a Fabric Quality Test can be started for each installed interconnect fabric (0 or 1) from withinsciadmin via Cluster Fabric * Test Traffic.

Warning

Running this test will stop the normal traffic over the interconnect as the routing needs to be changed.If you run this test while your cluster is in production, you might experience communication timeouts.

Page 30: DIS Install Guide Book

Initial Installation

23

SuperSockets in operation will fall back to a second fabric (if installed) or to Ethernet during this test,which also leads to increased communication delays.

This test will run for a few minutes, depending on the size of your cluster, as it tests communication for about 20seconds between each pair of nodes within the same ring. This means, for a 4 by 4 2D-torus cluster which features8 rings with 4 nodes each, it will take 8 * ( 3 + 2 +1) * 20 seconds = 16 minutes. It will then report if any CRCerrors or other problems have occurred between any pairs of nodes.

Note

Any communication errors reported here are either corrected automatically by retrying a data transfer(like for CRC errors), or are reported. Thus, an communication error does not mean data might get lost.However, every communication error reduces the performances, and an optimally set up Dolphin Expressinterconnect should not show any communication errors. A small number of communication errors isacceptable, though. Please contact Dolphin support if in doubt.

If the test reports communication errors, please proceed as follows:

1. If errors are reported between multiple pairs of nodes, locate the pair of nodes which is located most closely(has the smallest number of cable connections between them). Normally, if any errors are reported, a pair ofnodes located next to each other will show up.

2. Check the cable connection on the shortest path between these two nodes (a single cable, if nodes are locatednext to each other) for being properly mounted:

a. No excessive stress on the cable, like bending it to sharply or too much force on the plugs.

b. Cable plugs need to be placed in the connectors on the adapters evenly (not tilted) and securely fastened.If in doubt, unplug cable and re-fasten it.

3. Perform the previous check for all other node pairs; then re-run the test.

4. If communication errors persist, change cables to locate a possibly damaged cable:

a. Exchange the cables between the most close pair of nodes one-by-one with a cable of a connection forwhich no errors have been reported. Remember (note down) which cables you exchanged.

b. Run the Fabric Quality Test after each cable exchange.

i. If the communication errors move with the cable you just exchanged, then this cable might bedamaged. Please contact your sales representative for exchange.

ii. If the communication error remains unchanged, the problem might be with one of the adapters.Please contact Dolphin support for further analysis.

4.8. Making Cluster Application use Dolphin Express

After the Dolphin Express hard- and software has been installed and tested, you will want your cluster applicationto make use of the increased performance.

4.8.1. Generic Socket Applications

All applications that use generic BSD sockets for communication will be accelerated by SuperSockets. No con-figuration change is required for the application as the same host names/IP v4 addresses can be used. All relevantsocket types are supported by SuperSockets: TCP stream sockets as well as UDP and RDS datagram sockets.

SuperSockets will use the Dolphin Express interconnect for low-latency, high-bandwidth communication insidethe cluster, and will transparently fall back to Ethernet when connecting to nodes outside the cluster.

To make an application use SuperSockets, you need to preload a dynamic library on application start. This can beachieved by two means as described in the next two sections.

Page 31: DIS Install Guide Book

Initial Installation

24

4.8.1.1. Launch via Wrapper Script

To let generic socket applications use SuperSockets™, you just need to run them via the wrapper scriptdis_ssocks_run that sets the LD_PRELOAD environment variable accordingly. This script is installed to the bindirectory of the installation (default is /opt/DIS/bin) which is added to the default PATH environment variable.

To have i.e. the socket benchmark netperf run via SuperSockets™, start the server process on node server_namelike

dis_ssocks_run netperf

and the client process on any other node in the cluster like

dis_ssocks_run netperf -h server_name

4.8.1.2. Launch with LD_PRELOAD

As an alternative to using this wrapper script, you can also make sure to set LD_PRELOAD correctly to preload theSuperSockets library, i.e. for sh-style shells such as bash:

export LD_PRELOAD=libksupersockets.so

4.8.1.3. Troubeshooting

If the applications you are using do not show increased performance, please verify that they use SuperSocketsas follows:

1. To verify that the preloading works, use the ldd command on any executable, i.e. the netperf binary mentionedabove:

$ export LD_PRELOAD=libksupersockets.so$ ldd netperf libksupersockets.so => /opt/DIS/lib64/libksupersockets.so (0x0000002a95577000) libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x00000033ed300000) libc.so.6 => /lib64/tls/libc.so.6 (0x00000033ec800000) libdl.so.2 => /lib64/libdl.so.2 (0x00000033ecb00000) /lib64/ld-linux-x86-64.so.2 (0x00000033ec600000)

The library libksupersockets.so has to be listed at the top position. If this is not the case, make sure thelibrary file actually exists. The default locations are /opt/DIS/lib/libksupersockets.so and /opt/DIS/lib64/libksupersockets.so on 64-bit platforms, and libksupersockets.so actually is a symbolic linkon a library with the same name and a version suffix:

$ ls -lR /opt/DIS/lib*/*ksupersockets*-rw-r--r-- 1 root root 29498 Nov 14 12:43 /opt/DIS/lib64/libksupersockets.a-rw-r--r-- 1 root root 901 Nov 14 12:43 /opt/DIS/lib64/libksupersockets.lalrwxrwxrwx 1 root root 25 Nov 14 12:50 /opt/DIS/lib64/libksupersockets.so -> libksupersockets.so.3.3.0lrwxrwxrwx 1 root root 25 Nov 14 12:50 /opt/DIS/lib64/libksupersockets.so.3 -> libksupersockets.so.3.3.0-rw-r--r-- 1 root root 65160 Nov 14 12:43 /opt/DIS/lib64/libksupersockets.so.3.3.0-rw-r--r-- 1 root root 19746 Nov 14 12:43 /opt/DIS/lib/libksupersockets.a-rw-r--r-- 1 root root 899 Nov 14 12:43 /opt/DIS/lib/libksupersockets.lalrwxrwxrwx 1 root root 25 Nov 14 12:50 /opt/DIS/lib/libksupersockets.so -> libksupersockets.so.3.3.0lrwxrwxrwx 1 root root 25 Nov 14 12:50 /opt/DIS/lib/libksupersockets.so.3 -> libksupersockets.so.3.3.0-rw-r--r-- 1 root root 48731 Nov 14 12:43 /opt/DIS/lib/libksupersockets.so.3.3.0

Also, make sure that the dynamic linker is configured to find it in this place. The dynamic linker is configuredaccordingly on installation of the RPM; if you did not install via RPM, you need to configure the dynamiclinker manually. To verify that the dynamic linking is the problem, set LD_LIBRARY_PATH to include the pathto libksupersockets.so and verify again with ldd:

$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/DIS/lib:/opt/DIS/lib64$ echo $LD_PRELOADlibksupersockets.so$ ldd netperf....

A better solution than setting LD_LIBRARY_PATH is to configure the dynamic linker ld to include thesedirectories in its search path. Use man ldconfig to learn how to achieve this.

Page 32: DIS Install Guide Book

Initial Installation

25

2. You need to make sure that the preloading of the SuperSockets library described above is effective on bothnodes, for both applications that should communicate via SuperSockets.

3. Make sure that the SuperSockets kernel module (and the kernel modules it depends on) are loaded and con-figured correctly on both nodes.

1. Check the status of all Dolphin kernel modules via the dis_services script (defaut location /opt/DIS/sbin):

# dis_services statusDolphin IRM 3.3.0 ( November 13th 2007 ) is running.Dolphin Node Manager is running (pid 3172).Dolphin SISCI 3.3.0 ( November 13th 2007 ) is running.Dolphin SuperSockets 3.3.0 "St.Martin", Nov 7th 2007 (built Nov 14 2007) running.

At least the services dis_irm and dis_supersockets need to be running, and you should not see a messageabout SuperSockets not being configured.

2. Verify the configuration of the SuperSockets to make sure that all cluster nodes will connect and com-municate via SuperSockets. The active configuration is shown in /proc/net/af_sci/socket_maps:

# cat /proc/net/af_sci/socket_mapsIP/net Adapter NodeId List-----------------------------------------------172.16.5.1/32 0x0000 4 0 0172.16.5.2/32 0x0000 8 0 0172.16.5.3/32 0x0000 68 0 0172.16.5.4/32 0x0000 72 0 0

Depending on the configuration variant you used to set up SuperSockets, the content of this file maylook different, but it must never be empty and should be identical on all nodes. The examle above showsa four-node cluster with a single fabric and a static SuperSockets configuration, which will accelerateone socket interface per node.

For more information on the configuration of SuperSockets, please refer to Section 1.1, “dishosts.conf”.

3. Make sure that the host names/IP addresses used effectively by the application are the ones that areconfigured for SuperSockets, especially if the nodes have multiple Ethernet interfaces configured.

4. Check the system log for messages of the SuperSockets kernel module. It will report problems all problems,i.e. when running out of resources.

# cat /var/log/messages | grep dis_ssocks

It is a good idea to monitor the system log while you try to connect to a remote node if you suspect problemsbeing reported there:

# tail -f /var/log/messages

For an explanation of typical error messages, please refer to Section 2, “Software”.

5. Don't forget to check if the port numbers used by this application, or the application itself have been explicitlybeen exclued from using SuperSockets. By default, only the system port numbers below 1024 are excludedfrom using SuperSockets, but you should verify the current configuration (see Section 2, “SuperSocketsConfiguration”).

6. If you can't solve the problem, please contact Dolphin Support.

4.8.2. Native SCI Applications

Native SCI applications use the SISCI API to use the Dolphin Express hardware features like transparent remotememory access, DMA transfers or remote interrupts. The SISCI library libsisci.so is installed on the nodesby default.

Page 33: DIS Install Guide Book

Initial Installation

26

Note

The SISCI library is only available in the native bit width of a machine. This implies that on 64-bitmachines, only 64-bit SISCI applications can be created and executed as there is no 32-bit version of theSISCI library on 64-bit machines.

To compile and link SISCI applications like the MPI-Implementation NMPI, the SISCI-devel RPM needs tobe installed on the respective machine. This RPM is built during installation and placed in the node_RPMS andfrontend_RPMS directory, respectively.

4.8.3. Kernel Socket Services

SuperSockets can also be used to accelerate kernel services that communicate via sockets. However, such servicesneed to be adapted to actually use SuperSockets (a minor modification to make them use a different address familywhen opening new sockets).

If you are interested in accelerated kernel services like iSCSI, GNBD or others, please contact Dolphin Support.

Page 34: DIS Install Guide Book

27

Chapter 4. Update InstallationThis chapter describes how an existing Dolphin Express software stack is to be updated to a new version using theSIA. Dolphin Express software supports "rollling upgrades" between all release unless explicitly noted otherwisein the release notes.

1. Complete UpdateOpposed to the initial installation, the update installation can be performed in a fully automatic manner withoutmanual intervention. Therefore, this convenient update method is recommended if you can afford some downtimeof the whole cluster. Typically, the update of a 16-node cluster takes about 30 minutes.

A complete update is also required in case or protocol incompatibilities between the installed version and theversion to be installed. Such incompatibilities are rare and will be described in the release notes. If this is applies, arolling update is not possible, but you will need to update the system completely in one operation. This will makeDolphin Express functionality unavailable for the duration of this update.

Proceed as follows to perform the complete update installation:

1. Stop the applications using Dolphin Express on all nodes. This step can be omitted if you choose the --rebootoption below.

2. Become superuser on the frontend.

3. Run the SIA on the frontend with any combination of the following options:

--install-all This is the default installation variant and will update all nodes and the frontend.

You can specifiy --install-node or --install-frontend here to update only thecurrent node or the frontend (you need to execute the SIA on the respective node inthese cases!)

--batch Using this option, the script will run without any user interaction, assuming the defaultanswers to all questions which would otherwise be posed to the user. This option cansafely be used if no configuration changes are needed, and if you know that all ser-vices/applications using Dolphin Express are stopped on the nodes.

--reboot Rebooting the nodes in the course of the installation will avoid any problems whenloading the updated drivers. Such problems can occur because the drivers are currentlyin use, or due to resource problems. This option is recommended.

--enforce By default, packages on a node or the frontend will only be updated if the new pack-age has a more recent version than the installed package. This option will enforce theuninstallation of the installed package, followed by the installation of the new package.This option is recommended if you are unsure about the state of the installation.

As an example, the complete, non-interactive and enforced installation of a specific driver version (providedvia the SIA) with a reboot of all nodes will be invoked as follows:

# sh DIS_install_<version>.sh --install-all --batch --reboot --enforce

4. Wait for the SIA to complete. The updated Dolphin Express services will be running on the nodes and thefrontend.

2. Rolling UpdateA rolling update will keep your cluster and all its services available on all but one node. This kind of update needsto be performed node by node. It requires that you stop all applications which use the Dolphin Express software

Page 35: DIS Install Guide Book

Update Installation

28

stack (like a database server using SuperSockets) on the node you intend to update. This means your systems needsto tolerate applications going down on a single node.

Before performing a rolling update, please refer to the release notes of the new version to be installed if it supportsa rolling update of the version currently installed. If this is not the case, you need to perform a complete update(see previous section).

Note

It possilbe to install the updated files while the applications are still using Dolphin Express services.However, in this case the updated Dolphin Express services will not become active until you restart them(or reboot the machine).

Perform the following steps on each node:

1. Log into the node and become superuser (root).

2. Build the new binary RPM packages for this node:

# sh DIS_install_<version>.sh --build-rpm

The created binary RPM packages will be stored in the subdirectories node_RPMS and frontend_RPMS whichwill be created in the current working directory.

Tip

To save a lot of time, you can use the binary RPM packages built on the first node that is updated onall other nodes (if they have the same CPU architecture and Linux version). Please see Section 2.3,“Installing from Binary RPMs” for more information.

3. Stop all applications on this node that use Dolphin Express services, like a MySQL server or NDB process.

4. Stop all Dolphin Express services on this node using the dis_services command:

# dis_services stopStopping Dolphin SuperSockets drivers [ OK ]Stopping Dolphin SISCI driver [ OK ]Stopping Dolphin Node Manager [ OK ]Stopping Dolphin IRM driver [ OK ]

If you run sciadmin, you will notice that this node will show up as disabled (not active).

Note

The SIA will also try to stop all services when doing an update installation. Performing this stepexplicitly will just assure that the services can be stopped, and that the applications are shut downproperly.

If the services can not be stopped for some reason, you can still update the node, but you have to reboot it toenable the updated services. See the --reboot option in the next step.

5. Run the SIA with the --install-node --use-rpms <path> options to install and updated RPM packagesand start the updated drivers and services. The <path> parameter to the --use-rpms option has to point tothe directory where the binary RPM packages have been built (see step 1). If you had run the SIA in /tmpin step 1, you would issue the following command:

# sh DIS_install_<version>.sh --install-node --use-rpms /tmp

Adding the option --reboot will reboot the node after the installation has been successful. A reboot is notrequired if the services were shut down successfully in step 4, but recommend to allow the low-level driverthe allocation of suffcient memory resources for remote-memory access commuincation.

Page 36: DIS Install Guide Book

Update Installation

29

Important

If the services could not be stopped in step 4, a reboot is required to allow the updated drivers to beloaded. Otherwise, the new drivers will only be installed on disk, but will not be loaded and used.

If for some reason you want to re-install the same version, or even an older version of the Dolphin Expresssoftware stack than is currently installed, you need to use the --enforce option.

6. The updated services will be started by the installation and are available for use by the applications. Makesure that node has shown up as active (green) in sciadmin again before updating the next node.

If the services failed to start, a reboot of the node will fix the problem. This can be caused by situations wherethe memory is too fragmented for the low-level driver (see above).

Page 37: DIS Install Guide Book

30

Chapter 5. Manual InstallationThis chapter explains how to manually install the different software packages on the nodes and the frontend, andhow to install the software if the native package format of a platform is not supported by the Dolphin installer SIA.

1. Installation under LoadThis section describes how to perform the initial Dolphin Express™ installation on a cluster in operation withoutthe requirement to stop the whole cluster from operating.

This type of installation does not require more than one node at a time being offline. Because MySQL Cluster™is fault-tolerant by default, this will not stop your cluster application. However, performance may suffer to somedegree. It will be necessary to power off single nodes in the course of this installation (unless your machinessupport PCI hotplug - in this case, please contact Dolphin support).

1. Installing the drivers on the nodes

On all nodes, run the SIA with the option --install-node. This is a local operation which will build andinstall the drivers on the local machine only. As the Dolphin Express hardware is not yet installed, thisoperation will report errors which can be ignored. Do not reboot the nodes now!

Tip

You can speed up this node installation by re-using binary RPMs that have been build on anothernode with the same kernel version and the same CPU architecture. To do so, proceed as follows:

1. After the first installation on a node, the binary RPMs are located in the directories node_RPMSand frontend_RPMS, located in the directory where you launched the SIA. Copy these sub-di-rectories to a path that is accessible from the other nodes.

2. When installing on another node with the same Linux kernel version and CPU architecture, usethe --use-rpms option to tell SIA where it can find matching RPMs for this node, so it doesnot have to build them once more.

2. Installing the Dolphin Express hardware

For an installation under load, perform the following steps for each node one by one:

1. Shut down your application processes on the current node.

2. Power off the node, and install the Dolphin Express adapter (see Section 3, “Adapter Card Installation”).Do not yet connect any cables!

3. Power on the node and boot it up. The Dolphin Express drivers should load successfully now, althoughthe SuperSockets™ service will not be configured. Verify this via dis_services:

# dis_services statusDolphin IRM 3.3.0 ( November 13th 2007 ) is running.Dolphin Node Manager is running (pid 3172).Dolphin SISCI 3.3.0 ( November 13th 2007 ) is running.Dolphin SuperSockets 3.3.0 "St.Martin", Nov 7th 2007 (built Nov 14 2007) loaded, but not configured.

4. Stop the SuperSockets™ service:

# service dis_supersockets stopStopping Dolphin SuperSockets drivers [ OK ]

5. Start all your own applications on the current node and make sure the whole cluster operates normally.

6. Proceed with the next node until all nodes have the Dolphin Express hardware and software installed.

Page 38: DIS Install Guide Book

Manual Installation

31

3. Creating the cluster configuration files

If you have a Linux machine with X available which can run GUI applications, run the SIA with the --in-stall-editor option to install the tool dishostseditor. Ideally, this step is performed on the frontend. If thisis the case, you should create the directory /etc/dis and make it writable for root:

# mkdir /etc/dis# chmod 755 /etc/dis

After the SIA has completed the installation, start the tool dishostseditor (default installation location is/opt/DIS/sbin):

# /opt/DIS/sbin/dishostseditor

Information on how to work with this tool can be found in Section 4.3, “Working with the dishostseditor”.Make sure you create the cabling instructions needed in the next step.

If the dishostseditor was run as root on the frontend, proceed with the next step. Otherwise, copy the config-uration files dishosts.conf and networkmanager.conf which you have just created to the frontend andplace it there under /etc/dis (you may need to create this directory, see above).

4. Cable Installation

Using the cabling instructions created by dishostseditor in the previous step, the interconnect cables shouldnow be connected (see Section 4.4, “Cluster Cabling”).

5. On the frontend machine, run the SIA with the --install-frontend option. This will start the networkmanager, which will then configure the whole cluster according to the configuration files created in theprevious steps.

6. Start all services on all the nodes:

# dis_services startStarting Dolphin IRM 3.3.0 ( November 13th 2007 ) [ OK ]Starting Dolphin Node Manager [ OK ]Starting Dolphin SISCI 3.3.0 ( November 13th 2007 ) [ OK ]Starting Dolphin SuperSockets drivers [ OK ]

7. Verify the functionality and performance according to Section 1, “Verifying Functionality and Performance”.

8. At this point, Dolphin Express™ and SuperSockets™ are ready to use, but your application is still runningon Ethernet. To make your application use SuperSockets™, you need to perform the following steps on eachnode one-by-one:

1. Shut down your application processes on the current node.

2. Refer to Section 4.8, “Making Cluster Application use Dolphin Express” to determine the best way tohave you application use SuperSockets™. Typically, this can be achieved by simply starting the processvia the dis_ssocks_run wrapper script (located in /opt/DIS/bin by default), like:

$ dis_ssocks_run mysqld_safe

3. Start all your own applications on the current node and make sure the whole cluster operates normally.Because SuperSockets™ fall back to Ethernet transparently, your applications will start up normallyindependently from applications on the other nodes already using SuperSockets™ or not.

After you have performed these steps on all nodes, all applications that have been started accordingly willnow communicate via SuperSockets™.

Note

This single-node installation mode will not adapt the driver configuration dis_irm.conf to optimally fityour cluster. This might be necessary for clusters with more than 4 nodes. Please refer to Section 3.1,“dis_irm.conf” to perform recommended changes, or contact Dolphin support.

Page 39: DIS Install Guide Book

Manual Installation

32

2. Installation of a Heterogeneous Cluster

This section describes how to perform the initial Dolphin Express™ installation on with heterogeneous nodes(different CPU architecture, different operating system version (i.e. Linux kernel version), or even different op-erating systems).

Note

This single-node installation mode will not adapt the driver configuration dis_irm.conf to optimally fityour cluster. This might be necessary for clusters with more than 4 nodes. Please refer to Section 3.1,“dis_irm.conf” to perform recommended changes, or contact Dolphin support.

1. Installing the Dolphin Express hardware

Power off all nodes, and install the Dolphin Express adapter (see Section 3, “Adapter Card Installation”).Do not yet connect any cables!

Then, power up all nodes again.

2. Installing the drivers on the nodes

1. On all nodes, run the SIA with the option --install-node. This is a local operation which will buildand install the drivers on the local machine only.

Tip

You can speed up this node installation by re-using binary RPMs that have been build on an-other node with the same kernel version and the same CPU architecture. To do so, proceedas follows:

1. After the first installation on a node, the binary RPMs are located in the directoriesnode_RPMS and frontend_RPMS, located in the directory where you launched the SIA.Copy these sub-directories to a path that is accessible from the other nodes.

2. When installing on another node with the same Linux kernel version and CPU architecture,use the --use-rpms option to tell SIA where it can find matching RPMs for this node, soit does not have to build them once more.

2. The Dolphin Express drivers should load successfully now, although the SuperSockets™ service willnot be configured. Verify this via dis_services:

# dis_services statusDolphin IRM 3.3.0 ( November 13th 2007 ) is running.Dolphin Node Manager is running (pid 3172).Dolphin SISCI 3.3.0 ( November 13th 2007 ) is running.Dolphin SuperSockets 3.3.0 "St.Martin", Nov 7th 2007 (built Nov 14 2007) loaded, but not configured.

3. Stop the SuperSockets™ service:

# service dis_supersockets stopStopping Dolphin SuperSockets drivers [ OK ]

3. Creating the cluster configuration files

If you have a Linux machine with X available which can run GUI applications, run the SIA with the --in-stall-editor option to install the tool dishostseditor. Ideally, this step is performed on the frontend. If thisis the case, you should create the directory /etc/dis and make it writable for root:

# mkdir /etc/dis# chmod 755 /etc/dis

Page 40: DIS Install Guide Book

Manual Installation

33

After the SIA has completed the installation, start the tool dishostseditor (default installation location is /opt/DIS/sbin):

# /opt/DIS/sbin/dishostseditor

Information on how to work with this tool can be found in Section 4.3, “Working with the dishostseditor”.Make sure you create the cabling instructions needed in the next step.

If the dishostseditor was run as root on the frontend, proceed with the next step. Otherwise, copy the config-uration files dishosts.conf and networkmanager.conf which you have just created to the frontend andplace it there under /etc/dis (you may need to create this directory).

4. Cable Installation

Using the cabling instructions created by dishostseditor in the previous step, the interconnect cables shouldnow be connected (see Section 4.4, “Cluster Cabling”).

5. On the frontend machine, run the SIA with the --install-frontend option. This will start the networkmanager, which will then configure the whole cluster according to the configuration files created in theprevious steps.

6. Start all services on all the nodes:

# dis_services startStarting Dolphin IRM 3.3.0 ( November 13th 2007 ) [ OK ]Starting Dolphin Node Manager [ OK ]Starting Dolphin SISCI 3.3.0 ( November 13th 2007 ) [ OK ]Starting Dolphin SuperSockets drivers [ OK ]

7. Verify the functionality and performance according to Section 1, “Verifying Functionality and Performance”.

3. Manual RPM InstallationIt is of course possible to manually install the RPM packages on the nodes and the frontend. This section describeshow to do this if it should be necessary.

3.1. RPM Package Structure

The Dolphin Express™ software stack is organized into a number of RPM packages. Some of these packageshave inter-dependencies.

Dolphin-SCI Low-level hardware driver for the adapter. Installs the dis_irm kernel module, the node man-ager daemon and the dis_irm and dis_nodemgr services on a node.

To be installed on all nodes.

Dolphin-SISCI User-level access to the adapter capabilites via the SISCI API. Installs the dis_sisci kernelmodule and the dis_sisci service, plus the run-time I libary and header files for the SISCIAPI on a node.

To be installed on all nodes. Depends on Dolphin-SCI.

Dolphin-SuperSockets Standard Berkeley sockets with low latency and high bandwidth. Installs thedis_mbox, dis_msq and dis_ssocks kernel modules and the dis_supersocketsservice, and the redirection library for preloading on a node.

To be installed on all nodes. Depends on Dolphin-SCI.

Dolphin-NetworkHosts Installs the GUI application dishostseditor for creating the cluster configurationfiles on the frontend. Also installs some template configuration files for manualediting.

Page 41: DIS Install Guide Book

Manual Installation

34

To be installed on the frontend (and addtionally other machines that should rundishostseditor).

Dolphin-NetworkManager Contains the network manager on the frontend, which talks to all node man-agers on the nodes. Installs the service dis_networkmgr.

To be installed on the frontend. Depends on Dolphin-NetworkHosts.

Dolphin-NetworkAdmin Contains the GUI application sciadmin for managing and monitoring the inter-connect. sciadmin talks to the network manager and can be installed on any ma-chine that has connection to the frontend.

To be installed on the frontend (or any other machine).

Dolphin-SISCI-devel To compile and link applications that use the SISCI API on other machines thanthe nodes, this RPM installs the header files and library plus examples and docu-mentation on any machine.

To be installed on the frontend, or any other machine on which SISCI applicationsshould be compiled and linked.

3.2. RPM Build and Installation

On each machine, the matching binary RPM packages need to be built by calling the SIA with the --build-rpmoption. This will take some minutes, and the resulting RPMs are stored in three directories:

node_RPMS Contains the binary RPM packages for the driver and kernel modules to be installed on eachnode. These RPM packages can be installed on every node with the same kernel version.

frontend_RPMS Contains the binary RPM packages for the user-level managing software to be installed onthe frontend. The Dolphin-SISCI-devel and Dolphin-NetworkAdmin packages can also beinstalled on other nodes, not being the frontend or any of the nodes, for development andadministration, respectively.

source_RPMS The source RPM packages contained in this directory can be used to build binary RPMs onother machines using the standard rpmbuild command.

To install the packages from one directory, just enter the directory and install them all with a single call of therpm command, like:

# cd node_RPMS# rpm -Uhv *.rpm

4. Unpackaged InstallationNot all target operating systems are supported with native software packages. In this case, a non-package basedinstallation via a tar-archive is supported. This type of installation will build all software for both, node and fron-tend, and install it to a path that you specify. From there, you have to perform the actual driver and service instal-lation using scripts provided with the installation.

This type of installation installs the complete software into a directory on the local machine. Depending on whetherthis machine will be a node or the frontend, you have to install different drivers or services from there. To installusing this method, please proceed as follows:

1. Become superuser:

$ su - #

2. Create the tar archive from the SIA, and upack it:

Page 42: DIS Install Guide Book

Manual Installation

35

# sh DIS_install_<version>.sh --get-tarball#* Logfile is /tmp/DIS_install.log_260 on node1

#*#+ Dolphin ICS - Software installation (version: 1.31 $ of: 2007/09/27 15:05:05 $)#+

#* Generating tarball distribution of the source code

#* NOTE: source tarball is /tmp/DIS.tar.gz

# tar xzf DIS.tar.gz

3. Enter the created directory and configure the build system, specifying the target path <install_path> forthe installation. We recommend that you use the standard path /opt/DIS, but you can use any other path.The installation procedure will create subdirectories (like bin, sbin, lib, lib64, doc, man, etc) relative tothis path and install into them.

# cd DIS# ./configure --prefix=/opt/DIS

4. Build the software stack using make. Check the output when the command returns to see if it the buildoperation was successful.

# make...# make supersockets...

5. If the build operations were successful, install the software:

# make install...# make supersockets-install...

Tip

You can speed up the installation on multiple nodes if you copy over the installation directory tothe other nodes, provided they features the same Linux kernel version and CPU architecture. Thebest way is to create a tar archive:

# cd /opt# tar czf DIS_binary.tar.gz DIS

Transfer this file to /opt on all nodes and unpack it there:

# cd /opt# tar xzf DIS_binary.tar.gz

6. Install the drivers and services depending on whether the local machine should be a node or the frontend. It isrecommended to first install all nodes, then the frontend, than configure and test the cluster from the frontend.

For a node, install the necessary drivers and services as follows:

1. Change to the sbin directory in your installation path:

# cd /opt/DIS/sbin

2. Invoke the scripts for driver installation using the option -i. The option --start will start the serviceafter a successful installation:

# ./irm_setup -i --start# ./nodemgr_setup -i --start# ./sisci_setup -i --start# ./ssocks_setup -i

Page 43: DIS Install Guide Book

Manual Installation

36

Note

Please make sure that SuperSockets are not started yet (do not provide option --start to thesetup script).

You can remove the driver from the system by calling the script with the option -e. Help is availablevia -h.

Repeat this procedure for each node.

For the frontend, install the necessary services and perform the cluster configuration and test as follows:

1. Change to the sbin directory in your installation path:

# cd /opt/DIS/sbin

2. Configure the cluster via the GUI tool dishostseditor:

# ./dishostseditor

For more information on using dishostseditor, please refer to Section 4.3, “Working with the dishostsed-itor”.

3. Invoke the script for service installation using the option -i:

# ./networkmgr_setup -i --start

You can remove the service from the system by calling the script with the option -e.

4. Test the cluster via the GUI tool sciadmin:

# ./sciadmin

For more information on using sciadmin to test your cluster installation, please refer to Appendix B,sciadmin Reference and Section 1, “Verifying Functionality and Performance”.

5. Enable all services, including SuperSockets, on all nodes.

# dis_services startStarting Dolphin IRM 3.3.0 ( November 13th 2007 ) [ OK ]Starting Dolphin Node Manager [ OK ]Starting Dolphin SISCI 3.3.0 ( November 13th 2007 ) [ OK ]Starting Dolphin SuperSockets drivers [ OK ]

Note

This command has to be executed on the nodes, not only on the frontend!

Page 44: DIS Install Guide Book

37

Chapter 6. Interconnect and SoftwareMaintenanceThis chapter describes how to perform a number of typical tasks related to the Dolphin Express interconnect.

1. Verifying Functionality and PerformanceWhen installing the Dolphin Express™ software stack (which includes SuperSockets™) via the SIA, the basicfunctionality and performance is verified at the end of the installation process by some of the same tests that aredescribed in the following sections. This means, that if the tests performed by the SIA did not report any errors,it is very likely that both, the software and hardware work correctly.

Nevertheless, the following sections describe the tests that allow you to verify the functionality and performanceof your Dolphin Express™ interconnect and software stack. The tests go from the most low-level functionalityup to running socket applications via SuperSockets™.

1.1. Low-level Functionality and Performance

The following sections describe how to verify that the interconnect is setup correctly, which means that all nodescan communicate with all other nodes via the Dolphin Express interconnect by sending low-level control packetsand performing remote memory access.

1.1.1. Availability of Drivers and Services

Without the required drivers and services running on all nodes and the frontend, the cluster will fail to operate. Onthe nodes, the kernel services dis_irm (low level hardware driver), dis_sisci (upper level hardware services) anddis_ssocks (SuperSockets) need to be running. Next to these kernel drivers, the user-space service dis_nodemgr(node manager, which talks to the central network manager) needs to be active for configuration and monitoring.On the frontend, only the user-space service dis_networkmgr (the central network manager) needs to be running.

Because the drivers do also appear as services, you can query their status with the usual tools of the installedoperating system distribution. I.e., for Red Hat-based Linux distributions, you can do

# service dis_irm statusDolphin IRM 3.3.0 ( November 13th 2007 ) is running.

Dolphin provides a script dis_services that performs this task for all Dolphin services installed on a machine. It isused in the same way as the individual service command provided by the distribution:

# dis_services statusDolphin IRM 3.3.0 ( November 13th 2007 ) is running.Dolphin Node Manager is running (pid 3172).Dolphin SISCI 3.3.0 ( November 13th 2007 ) is running.Dolphin SuperSockets 3.3.0 "St.Martin", Nov 7th 2007 (built Nov 14 2007) running.

If any of the required services is not running, you will find more information on the problem that may have occuredin the system log facilities. Call dmesg to inspect the kernel messages, or check /var/log/messages for relatedmessages.

1.1.2. Cable Connection Test

To ensure that the cluster is cabled correctly, please perform the cable connection test as described in Section 4.7.4,“Cabling Correctness Test”.

1.1.3. Static Interconnect Test

The static interconnect test makes sures that all adapters are working correctly by performing a self-test, anddetermines if the setup of the routing in the adapters is correct (matches the actual hardware topology),. It willalso check if all cables are plugged in to the adapters, but this has already been done in the Cable Connection Test.The tool to perform this test is scidiag (default location /opt/DIS/sbin/scidiag).

Page 45: DIS Install Guide Book

Interconnect and Soft-ware Maintenance

38

Running scidiag on a node will perform a self test on the local adapter(s) and list all remote adapters that thisadapter can see via the Dolphin Express interconnect. This means, to perform the static interconnect test on a fullcluster, you will basically need to run scidiag on each node and see if any problems with the adapter are reported,and if the adapters in each node can see all remote adapters installed in the other nodes. An example output ofscidiag for a node which is part of a 9-node cluster configured in a 3 by 3 2D-torus, and using one adapter pernode looks like this:

=========================================================================== SCI diagnostic tool -- SciDiag version 3.2.6d ( September 6th 2007 )===========================================================================

******************** VARIOUS INFORMATION ********************

Scidiag compiled in 64 bit modeDriver : Dolphin IRM 3.2.6d ( September 6th 2007 )Date : Thu Oct 4 14:20:45 CEST 2007System : Linux tiger-9 2.6.9-42.0.3.ELsmp #1 SMP Fri Oct 6 06:28:26 CDT 2006 x86_64 x86_64 x86_64 GNU/Linux

Number of configured local adapters found: 1

Hostbridge : NVIDIA nForce 570 - MCP55 , 0x37610de

Local adapter 0 > Type : D352 NodeId(log) : 140 NodeId(phys) : 0x2204 SerialNum : 200284 PSB Version : 0x0d66706d LC Version : 0x1066606d PLD Firmware : 0x0001 SCI Link frequency : 166 MHz B-Link frequency : 80 MHz Card Revision : CD Switch Type : not present Topology Type : 2D Torus Topology Autodetect : No

OK: Psb chip alive in adapter 0.SCI Link 0 - uptime 11356 secondsSCI Link 0 - downtime 0 secondsSCI Link 1 - uptime 11356 secondsSCI Link 1 - downtime 0 secondsOK: Cable insertion ok.OK: Probe of local node ok.OK: Link alive in adapter 0.OK: SRAM test ok for Adapter 0OK: LC-3 chip accessible from blink in adapter 0.==> Local adapter 0 ok.

******************** TOPOLOGY SEEN FROM ADAPTER 0 ********************

Adapters found: 9 Switch ports found: 0----- List of all ranges (rings) found:In range 0: 0004 0008 0012In range 1: 0068 0072 0076In range 2: 0132 0136 0140

REMOTE NODE INFO SEEN FROM ADAPTER 0 Log | Phys | resp | resp | resp | resp | req | nodeId | nodeId | conflict | address | type | data | timeout | TOTAL 4 | 0x0004 | 0| 0| 0| 0| 4| 4 8 | 0x0104 | 0| 0| 0| 0| 1| 1 12 | 0x0204 | 0| 0| 0| 0| 0| 0 68 | 0x1004 | 0| 0| 0| 0| 2| 2 72 | 0x1104 | 0| 0| 0| 0| 0| 0 76 | 0x1204 | 0| 0| 0| 0| 0| 0 132 | 0x2004 | 0| 0| 0| 0| 1| 1 136 | 0x2104 | 0| 0| 0| 0| 1| 1 140 | 0x2200 | 0| 0| 0| 0| 0| 0----------------------------------

Page 46: DIS Install Guide Book

Interconnect and Soft-ware Maintenance

39

scidiag discovered 0 note(s).scidiag discovered 0 warning(s).scidiag discovered 0 error(s).TEST RESULT: *PASSED*

The static interconnect test passes if scidiag delivers TEST RESULT: *PASSED* and reports the same topology(remote adapters) on all nodes. More information on running scidiag is provided in ???, where you will also findhints on what to do if scidiag reports warning or errors, or reports different topology on different nodes.

1.1.4. Interconnect Load Test

While the static interconnect test sends very a few packets over the links to probe remote adapters, the InterconnectLoad Test puts significant stress on the interconnect and observes if any data transmissions have to be retried dueto link errors. This can happen if cables are not correctly connected, i.e. plugged in without screws being tightened.Before running this test, make sure your cluster is cabled and configured correctly by running the tests describedin the previous sections.

1.1.4.1. Test Execution from sciadmin GUI

This test can be performed from within the sciadmin GUI tool. Please refer to Appendix B, sciadmin Referencefor details.

1.1.4.2. Test Execution from Command Line

To run this test from the command line, simply invoke sciconntest (default location /opt/DIS/bin/sciconntest)on all nodes.

Note

It is recommended to run this test from the sciadmin GUI (see previous section) because it will performa more controlled variant of this test and give more helpful results.

All instances of sciconntest will connect and start to exchange data, which can take up to 30 seconds. The outputof sciconntest on one node which is part of a 9-node cluster looks like this:

/opt/DIS/bin/sciconntest compiled Oct 2 2007 : 22:29:09 ---------------------------- Local node-id : 76 Local adapter no. : 0 Segment size : 8192 MinSize : 4 Time to run (sec) : 10 Idelay : 0 No Write : 0 Loopdelay : 0 Delay : 0 Bad : 0 Check : 0 Mcheck : 0 Max nodes : 256 rnl : 0 Callbacks : Yes ---------------------------- Probing all nodes Response from remote node 4 Response from remote node 8 Response from remote node 12 Response from remote node 68 Response from remote node 72 Response from remote node 132 Response from remote node 136 Response from remote node 140 Local segment (id=4, size=8192) is created. Local segment (id=4, size=8192) is shared. Local segment (id=8, size=8192) is created. Local segment (id=8, size=8192) is shared.

Page 47: DIS Install Guide Book

Interconnect and Soft-ware Maintenance

40

Local segment (id=12, size=8192) is created. Local segment (id=12, size=8192) is shared. Local segment (id=68, size=8192) is created. Local segment (id=68, size=8192) is shared. Local segment (id=72, size=8192) is created. Local segment (id=72, size=8192) is shared. Local segment (id=132, size=8192) is created. Local segment (id=132, size=8192) is shared. Local segment (id=136, size=8192) is created. Local segment (id=136, size=8192) is shared. Local segment (id=140, size=8192) is created. Local segment (id=140, size=8192) is shared. Connecting to 8 nodes Connect to remote segment, node 4 Remote segment on node 4 is connected. Connect to remote segment, node 8 Remote segment on node 8 is connected. Connect to remote segment, node 12 Remote segment on node 12 is connected. Connect to remote segment, node 68 Remote segment on node 68 is connected. Connect to remote segment, node 72 Remote segment on node 72 is connected. Connect to remote segment, node 132 Remote segment on node 132 is connected. Connect to remote segment, node 136 Remote segment on node 136 is connected. Connect to remote segment, node 140 Remote segment on node 140 is connected. SCICONNTEST_REPORT NUM_TESTLOOPS_EXECUTED 1 NUM_NODES_FOUND 8 NUM_ERRORS_DETECTED 0 node 4 : Found node 4 : Number of failiures : 0 node 4 : Longest failiure : 0.00 (ms) node 8 : Found node 8 : Number of failiures : 0 node 8 : Longest failiure : 0.00 (ms) node 12 : Found node 12 : Number of failiures : 0 node 12 : Longest failiure : 0.00 (ms) node 68 : Found node 68 : Number of failiures : 0 node 68 : Longest failiure : 0.00 (ms) node 72 : Found node 72 : Number of failiures : 0 node 72 : Longest failiure : 0.00 (ms) node 132 : Found node 132 : Number of failiures : 0 node 132 : Longest failiure : 0.00 (ms) node 136 : Found node 136 : Number of failiures : 0 node 136 : Longest failiure : 0.00 (ms) node 140 : Found node 140 : Number of failiures : 0 node 140 : Longest failiure : 0.00 (ms) SCICONNTEST_REPORT_END

SCI_CB_DISCONNECT:Segment removed on the other node disconnecting.....

The test passes if all nodes report 0 failures for all remote nodes. If the test reports failures, you can determinethe closest pair(s) of nodes for which these failures are reported and check the cabled connection between them.The numerical node identifies shown in this output are the node ID numbers of the adapters (which identify anadapter in the Dolphin Express™ interconnect).

Although this test can be run while a system is in production, but you have to take into account that performance ofthe productive applications will be reduced significantly while this test is running. If links actually show problems,they might be temporarily disabled, stopping all communication until rerouting takes place.

Page 48: DIS Install Guide Book

Interconnect and Soft-ware Maintenance

41

1.1.5. Interconnect Performance Test

Once the correct installation and setup and the basic functionality of the interconnect have been verified, it is pos-sible to perform a set of low-level benchmarks to determine the base-line performance of the interconnect withoutany additional software layers. The tests that are relevant for this are scibench2 (streaming remote memory PIOaccess performance), scipp (request-response remote memory PIO write performance), dma_bench (streamingremote memory DMA access performance) and intr_bench (remote interrupt performance).

All these tests need to run on two nodes (A and B) and are started in the same manner:

1. Determine the Dolphin Express node id of both nodes using the query command (default path /opt/DIS/bin/query). The Dolphin Express node id is reported as "Local node-id".

2. On node A, start the server-side benchmark with the options -server and -rn <node id of B>, like:

$ scibench2 -server -rn 8

3. On node B, start the client-side benchmark with the options -client and -rn <node id of A>, like:

$ scibench2 -client -rn 4

4. The test results are reported by the client.

To simply gather all relevant low-level performance data, the script sisci_benchmarks.sh can be called in thesame way. It will run all of the described tests.

For the D33x and D35x series of Dolphin Express adapters, the following results can be expected for each testusing a single adapter:

scibench2 minimal latency to write 4 bytes to remote memory: 0.2µs

maximal bandwidth for streaming writes to remote memory: 340 MB/s

---------------------------------------------------------------Segment Size: Average Send Latency: Throughput:--------------------------------------------------------------- 4 0.20 us 19.72 MBytes/s 8 0.20 us 40.44 MBytes/s 16 0.20 us 80.89 MBytes/s 32 0.39 us 81.09 MBytes/s 64 0.25 us 254.22 MBytes/s 128 0.37 us 348.17 MBytes/s 256 0.74 us 344.89 MBytes/s 512 1.49 us 343.05 MBytes/s 1024 3.00 us 341.90 MBytes/s 2048 6.00 us 341.39 MBytes/s 4096 12.00 us 341.45 MBytes/s 8192 24.00 us 341.32 MBytes/s 16384 48.04 us 341.03 MBytes/s 32768 96.03 us 341.24 MBytes/s 65536 192.56 us 340.33 MBytes/s

scipp The minimal round-trip latency for writing to remote memory should be below 4µs. The averagenumber of retries is not a performance metric and can vary from run to run.

Ping Pong round trip latency for 0 bytes, average retries= 1292 3.69 usPing Pong round trip latency for 4 bytes, average retries= 365 3.94 usPing Pong round trip latency for 8 bytes, average retries= 359 3.98 usPing Pong round trip latency for 16 bytes, average retries= 357 4.01 usPing Pong round trip latency for 32 bytes, average retries= 4 4.58 usPing Pong round trip latency for 64 bytes, average retries= 346 4.30 usPing Pong round trip latency for 128 bytes, average retries= 871 6.26 usPing Pong round trip latency for 256 bytes, average retries= 832 6.49 usPing Pong round trip latency for 512 bytes, average retries= 1072 7.99 usPing Pong round trip latency for 1024 bytes, average retries= 1643 10.99 usPing Pong round trip latency for 2048 bytes, average retries= 2738 17.00 us

Page 49: DIS Install Guide Book

Interconnect and Soft-ware Maintenance

42

Ping Pong round trip latency for 4096 bytes, average retries= 4974 29.00 usPing Pong round trip latency for 8192 bytes, average retries= 9401 53.06 us

intr_bench The interrupt latency is the only performance metric of these tests that is affected by the operatingsystem which always handles the interrupts and can therefore vary. The following number havebeen measured with RHEL 4 (Linux Kernel 2.6.9):

Average unidirectional interrupt time : 7.665 us.Average round trip interrupt time : 15.330 us.

dma_bench The typical DMA bandwidth achieved for 64kB transfers is 240MB/s, while the maximum band-width (for larger blocks) is at about 250MB/s:

64 19.63 us 3.26 MBytes/s128 19.69 us 6.50 MBytes/s256 20.36 us 12.57 MBytes/s512 21.08 us 24.29 MBytes/s1024 23.25 us 44.05 MBytes/s2048 26.80 us 76.42 MBytes/s4096 34.60 us 118.40 MBytes/s8192 50.30 us 162.85 MBytes/s16384 81.74 us 200.43 MBytes/s32768 144.73 us 226.41 MBytes/s65536 270.82 us 241.99 MBytes/s

1.2. SuperSockets Functionality and Performance

This section describes how to verify that SuperSockets are working correctly on a cluster.

dis_ssocks_cfg, sockperf, /proc/net/af_sci

1.2.1. SuperSockets Status

The general status of SuperSockets can be retrieved via the SuperSockets init script that controls the servicedis_supersockets. On Red Hat systems, this can be done like

# service dis_supersockets status

which should show a status of running. If the status shown here is loaded, but not configured, it means that theSuperSockets configuration failed for some reason. Typically, it means that a configuration file could not be parsedcorrectly. The configuration can be performed manually like

# /opt/DIS/sbin/dis_ssocks_cfg

If this indicates that a configuration file is corrupted, you can verify them according to the reference in Section 2,“SuperSockets Configuration”. At any time, you can re-create dishosts.conf using the dishostseditor and restoremodified SuperSockets configuration files (supersockets_ports.conf and supersockets_profiles.conf)from the default versions that have been installed in /opt/DIS/etc/dis.

Once the status of SuperSockets is running, you can verify their actual configuration via the files in /proc/net/af_sci. Here, the file socket_maps shows you, which IP address (or network mask) the local node's SuperSocketsknow about. This file should be non-empty and identical on all nodes in the cluster.

1.2.2. SuperSockets Functionality

A benchmark that can be used to validate the functionality and performance of SuperSockets is installed as /opt/DIS/bin/socket/sockperf. The basic usage requires two machines (n1 and n2). Start the server process on noden1 without any parameters:

$ sockperf

On node n2, run the client side of the benchmark like:

$ sockperf -h n1

Page 50: DIS Install Guide Book

Interconnect and Soft-ware Maintenance

43

The output for a working setup should look like this:

# sockperf 1.35 - test stream socket performance and system impact# LD_PRELOAD: libksupersockets.so# address family: 2# client node: n2 server nodes: n1# sockets per process: 1 - pattern: sequential# wait for data: blocking recv()# send mode: blocking# client/server pairs: 1 (running on 2 cores)# socket options: nodelay 1# communication pattern: PINGPONG (back-and-forth)# bytes loops avg_RTT/2[us] min_RTT/2[us] max_RTT/2[us] msg/s MB/s 1 1000 4.26 3.67 18.67 117247 0.12 4 1000 4.16 3.87 11.32 120177 0.48 8 1000 4.31 4.17 11.81 115889 0.93 12 1000 4.29 4.17 9.08 116537 1.40 16 1000 4.29 4.16 10.17 116468 1.86 24 1000 4.30 4.18 7.16 116251 2.79 32 1000 4.38 4.21 44.20 114233 3.66 48 1000 4.50 4.24 102.91 111112 5.33 64 1000 5.28 5.16 7.54 94687 6.06 80 1000 5.37 5.20 11.08 93170 7.45 96 1000 5.41 5.20 11.29 92473 8.88 112 1000 5.53 5.27 11.04 90400 10.12 128 1000 5.74 5.59 11.96 87033 11.14 160 1000 5.85 5.68 10.65 85411 13.67 192 1000 6.30 6.01 11.24 79383 15.24 224 1000 6.47 6.20 80.47 77291 17.31 256 1000 6.82 6.55 17.41 73314 18.77 512 1000 8.37 8.05 14.52 59766 30.60 1024 1000 11.69 11.38 17.66 42764 43.79 2048 1000 15.25 14.90 59.72 32792 67.16 4096 1000 22.40 22.03 33.08 22318 91.41 8192 512 47.19 46.39 52.45 10596 86.80 16384 256 72.87 72.20 78.05 6862 112.43 32768 128 124.56 123.52 132.97 4014 131.54 65536 64 225.73 224.68 230.26 2215 145.17

The latency in this example starts around 4µs. Recent machines deliver latencies below 3µs, and on older machines,the latency may be higher. Latencies above 10µs indicate a problem; typical Ethernet latencies start at 20µs andmore.

In case of latencies being to high, please verify if SuperSockets are running and configured as described in theprevious section. Also, verify that the environment variable LD_PRELOAD is set to libksupersockets.so. This isreported for the client in the second line of the output (see above), but LD_PRELOAD also needs to be set correctlyon the server side. See Section 4.8, “Making Cluster Application use Dolphin Express” for more information onhow to make generic socket applications (like sockperf) use SuperSockets.

1.3. SuperSockets Utilization

To verify if and how SuperSockets are used on a node in operation, the file /proc/net/af_sci/stats can beused:

$ cat /proc/net/af_sci/statsSTREAM sockets: 0DGRAM sockets: 0TX connections: 0RX connections: 0Extended statistics are disabled.

The first line shows the number of open TCP (STREAM) and UDP (DGRAM) sockets that are using SuperSockets.

For more detailed information, the extended statistics need to be enabled. Only the root user can do this:

# echo enable >/proc/net/af_sci/stats

Page 51: DIS Install Guide Book

Interconnect and Soft-ware Maintenance

44

With enabled statistics, /proc/net/af_sci/stats will display a message size histogram (next to some internalinformation). When looking at this histogram, please keep in mind that the listed receive sizes (RX) may beincorrect as it refers to the maximal number of bytes that a process wanted to recv when calling the related socketfunction. Many applications use larger buffers then actually required. Thus, only the send (TX) values are reliable.

To observe the current throughput on all SuperSockets-driven sockets, the tool dis_ssocks_stat can be used. Sup-ported options are:

-d Delay in seconds between measurements. This will cause dis_ssocks_stat to loop until interrupted.

-t Print time stamp next to measurement point.

-w Print all output to a single line.

-h Show available options.

Example:

# dis_ssocks_stat -d 1 -t(1 s) RX: 162.82 MB/s TX: 165.43 MB/s ( 0 B/s 0 B/s ) Mon Nov 12 17:59:33 CET 2007(1 s) RX: 149.83 MB/s TX: 168.65 MB/s ( 0 B/s 0 B/s ) Mon Nov 12 17:59:34 CET 2007 ...

The first two pairs show the receive (RX) and send (TX) throughput via Dolphin Express of all sockets. Thenumber pair in parentheses shows the throughput of sockets that operated by SuperSockets, but are currently infallback (Ethernet) mode. Typically, there will be no fallback traffic.

2. Replacing SCI CablesThe Dolphin Express interconnect is fully hot-pluggable. If one or more cables need to be replaced (or just needto be disconnected and reconnect again for some reason), you can do this while all nodes are up and running.However, if the cluster is in production, you should proceed cable by cable, and not disconnect all affected cablesat once to ensure continue operation without significant performance degradation.

To replace a single SCI cable, proceed as follows:

1. Disconnect the cable at both end. The LEDs on the affected adapters will turn yellow for this link, and thelink will show up as disabled in the sciadmin GUI.

2. Properly (re)connect the (replacement) cable. Observe that the LEDs on the adapters within the ringlet ligthgreen again after the cable is connected. The link will show up as enabled in the sciadmin GUI.

3. To verify that the new cable is working properly, two alternative procedures can be performed using thesciadmin GUI:

• Run the Cluster Test from within sciadmin. Note that this test will stop all other communication on theinterconnect while it is running.

• If running Cluster Test is not an option, you can check for potential transfer errrors as follows:

a. Scidiag Clear to reset the statistics of both nodes connected to the cable that has been replaced.

b. Operate the nodes under normal load for some minutes.

c. Perform Scidiag -V 1 on both nodes and verify if any error counters have increased, especially theCRC error counter.

If any of the verifications did report errors, make sure that the cable is plugged in cleanly, and that the screws aresecured. If the error should persist, swap the position of the cable with another one that is known to be working,and observe if the problem is wandering with the cable. If it does, the cable is likely to be bad. Otherwise, oneof the adapters might have a problem.

Page 52: DIS Install Guide Book

Interconnect and Soft-ware Maintenance

45

3. Replacing a PCI-SCI AdapterIn case an adapter needs to be replaced, proceed as follows:

1. Power down the node. The network manager will automatically reroute the interconnect traffic. When yourun sciadmin on the frontend, you will see the icon of the node turn red within the GUI representation ofthe cluster nodes.

2. Unplug all cables from the adapter. Remember (or mark the cables) into which plug on the adapter eachcables belongs.

3. Unmount the old adapter from to be replaced, and insert the new one into the node.

4. Power up the node; then connect the SCI cables in the same way they had been connected before. Make surethat all LEDs on all adapters in the affected ringlets light green again.

5. To verify the installation of the new adapter:

1. The icon of the node in the sciadmin GUI must have turned green again.

2. The output of the dis_services script should list all services as running.

3. Run Scidiag -V 1 for this adapter from sciadmin. No errors should be reported, and all nodes should bevisible to this adapter, as illustrated in this example for a 4-node machine:

Remote

6. Perform the cable test from within sciadmin to ensure that the cabling is correct (see Section 4.7.4, “CablingCorrectness Test”).

Warning

Running the cable test will stop other traffic on the interconnect for the time the test is running,which can be up to a minute. If this is not an option, please use scidiag from the commandline toverify the functionality of the interconnect (see Section 1.1.3, “Static Interconnect Test”).

Communication between all other nodes will continue uninterrupted during this procedure.

4. Physically Moving NodesIf you move a node physically without changing the cabling, it is obvious that no configuration change are nec-essary.

If however you have to exchange two nodes, or for other reasons place a node which has been part of the clusterat another position within the interconnect that requires the node to be connected to other nodes than before, thanthis change has to reflect in the cluster configuration as well. This can easily be done using either dishostseditoror a plain text editor.

Please proceed as follows:

1. Power down all nodes that are to be moved. The network manager will automatically reroute the interconnecttraffic. When you run sciadmin on the frontend, you will see the icon of the node turn red within the GUIrepresentation of the cluster nodes.

Warning

Powering down more than one node will make other nodes not accessible via SCI if the powered-down nodes are not located within one ringlet.

2. Move the nodes to the new location and connect the cables to the adapters. Do not yet power them up!

Page 53: DIS Install Guide Book

Interconnect and Soft-ware Maintenance

46

3. Update the cluster configuration file /etc/dis/dishosts.conf on the frontend by either using dishostseditoror a plain text editor:

• If using dishostseditor, load the original configuration (by running dishostseditor on the frontend, itwill be loaded automatically) and change the positions of the nodes within the torus. Save the newconfiguration.

• When using a plain text editor, exchange the hostnames of the nodes in this file. You can also change theadapter and socket names accordingly (which typically contain the hostnames), but this will not affectfunctionality.

4. Restart the network manager on the frontend to make the changed configuration effective.

5. Power up the nodes. Once they come up, their configuration will be change by the network manager to reflecttheir new position within the interconnect.

5. Replacing a NodeIn case a node needs to be replaced, proceed as follows concerning the SCI interconnect:

1. Power down the node. The network manager will automatically reroute the interconnect traffic. When yourun sciadmin on the frontend, you will see the icon of the node turn red within the GUI representation ofthe cluster nodes.

2. Unplug all cables from the adapter. Remember (or mark the cables) into which plug on the adapter eachcables belongs.

3. Unmount the adapter from the node to be replaced, and insert it into the new node.

4. Power up the node; then connect the SCI cables in the same way they had been connected before. Make surethat all LEDs on all adapters in the affected ringlets light green again.

5. Run the SIA with the option --install-node. To verify the installation after the SIA has finished:

1. The icon of the node in the sciadmin GUI must have turned green again.

2. The output of the dis_services script should list all services as running.

6. Perform the cable test from within sciadmin to ensure that the cabling is correct (see Section 4.7.4, “CablingCorrectness Test”).

Warning

Running the cable test will stop other traffic on the interconnect for the time the test is running,which can be up to a minute. If this is not an option, please use scidiag from the commandline toverify the functionality of the interconnect (see Section 1.1.3, “Static Interconnect Test”).

In case that more than one node needs to be replaced, please consider the following advice:

To ensure that all nodes that are not be replaced can continue to communicate via the SCI interconnect while othernodes are replaced, you should replace nodes in a ring-by-ring manner: power down nodes within one ringletonly. Bring this group of nodes back to operation before powering down the next group of nodes.

Communication between all other nodes will continue uninterrupted during this procedure.

6. Adding NodesTo add new nodes to the cluster, please proceed as follows:

1. Install the adapter in the nodes to be added and power the nodes up.

Page 54: DIS Install Guide Book

Interconnect and Soft-ware Maintenance

47

Important

Do not yet connect the SCI cables, but make sure that Ethernet communication towards the frontendis working.

2. Install the DIS software stack on all nodes via the --install-node option of the SIA.

3. Change the cluster configuration using dishostseditor:

1. Load the existing configuration.

2. In the cluster settings, change the topology to match the topology with all new nodes added.

3. Change the hostnames of the newly added nodes via the node settings of each node. Also make sure thatthe socket configuration matches those of the existing nodes.

4. Save the new cluster configuration. If desired, create and save or print the cabling instructions for theextended cluster.

5. If you are not running dishostseditor on the frontend, transfer the saved files dishosts.conf andcluster.conf to the directory /etc/dis on the frontend.

4. Restart the network manager on the frontend. If you run sciadmin, the new nodes should show up as redicons. All other nodes should continue to stay green.

5. Cable the new nodes:

1. First create the cable connections between the new nodes.

2. Then connect the new nodes to the cluster. Proceed cable-by-cable, which means you disconnect a cableat an "old" node and immeadeletly connect it to a new node (without disconnecting another cable first).This will ensure continue operation of all "old" nodes.

3. When you are done, all LEDs on all adapters should light green, and all node icons in sciadmin shouldalso light green.

6. Perform the cable test from within sciadmin to ensure that the cabling is correct (see Section 4.7.4, “CablingCorrectness Test”).

Warning

Running the cable test will stop other traffic on the interconnect for the time the test is running,which can be up to a minute. If this is not an option, please use scidiag from the commandline toverify the functionality of the interconnect (see Section 1.1.3, “Static Interconnect Test”).

7. Removing NodesTo permanently remove nodes from the cluster, please proceed as follows:

1. Change the cluster configuration using dishostseditor:

1. Load the existing configuration.

2. In the cluster settings, change the topology to match the topology with all nodes removed.

3. Because the topology change might cut of nodes from the cluster at the "wrong" end, you have to makesure that the hostnames and the placement within the new topology for the remaining nodes is correct.To do this, change the hostnames of nodes by double-clicking their icon and changing the hostnamein the displayed dialog box. If the SuperSocket™ configuration is based on the hostnames (not on thesubnet addresses), make sure that the name of the socket interface matches a modified hostname.

Page 55: DIS Install Guide Book

Interconnect and Soft-ware Maintenance

48

4. Save the new cluster configuration. If desired, create and save or print the cabling instructions for thereduced cluster.

5. If you are not running dishostseditor on the frontend, transfer the saved files dishosts.conf andcluster.conf to the directory /etc/dis on the frontend.

2. Restart the network manager on the frontend. If you run sciadmin, the removed nodes should no longer showup. All other nodes should continue to stay green.

3. Uncable the nodes to be removed one by one, making sure that the remaining nodes are cabled according tothe cabling instructions generated above.

4. On the nodes that have been removed from the cluster, the Dolphin Express software can easily be removedusing the SIA option --wipe, like:

# sh DIS_install_<version>.sh --wipe

This will remove all Dolphin software packages, services and configuration data from the node.

If no SIA is available, the same effect can be achieved by manually uninstalling all packages that start withDolphin-, remove potentially remaining installation directories (like /opt/DIS), and remove the configura-tion directory /etc/dis.

5. Perform the cable test from within sciadmin to ensure that the cabling is correct (see Section 4.7.4, “CablingCorrectness Test”).

Warning

Running the cable test will stop other traffic on the interconnect for the time the test is running,which can be up to a minute. If this is not an option, please use scidiag from the commandline toverify the functionality of the interconnect (see Section 1.1.3, “Static Interconnect Test”).

Page 56: DIS Install Guide Book

49

Chapter 7. MySQL OperationThis chapter covers topics that are specific to operating MySQL and MySQL Cluster with SuperSockets.

1. MySQL ClusterUsing MySQL Cluster with Dolphin Express and SuperSockets will significantly increase throughput and reducethe response time. Generally, SuperSockets operate fully transparently and no change in the MySQL Clusterconfiguration is necessary for a working setup. Please read below for some hints on performance tuning andtrouble-shooting specific to SuperSockets with MySQL Cluster.

1.1. SuperSockets Poll Optimization

SuperSockets offer and optimization of the functions used by processes to detect new data or other status changeson sockets (poll() and select()). This optimization typically helps to increase performance. It has howevershown that for certain MySQL Cluster setups and loads, this optimization can have a negative impact on perfor-mance. It is therefore recommended that you evaluate which setting of this optimization delivers the best perfor-mance for your setup and load.

This optimization can be controlled on a per-process basis and with a node-global default. Explicit per-pro-cess settings override the global default. For a per-process setting, you need to set the environment variableSSOCKS_SYSTEM_POLL to 0 to enable the optimization and to 1 to disable the optimization and use the functionsprovided by the operating system. To set the node-global default, the setting of the variable system_poll needsto be changed in /etc/dis/supersockets_profiles.conf. Please refer to Section 2, “SuperSockets Configu-ration” for more details.

1.2. NDBD Deadlock Timeout

In case you experience timeout errors between the ndbd processes on the cluster nodes, you will see error messageslike

ERROR 1205 (HY000) at line 1: Lock wait timeout exceeded; try restarting transaction

Such timeout problems can have different reasons like dead or overloaded nodes. One other possible reason couldbe that the time for a socket fail over between Dolphin Express and Ethernet exceeded the current timeout, whichby default is 1200ms. This should rarely happen, but to solve this problem, please proceed as follows:

1. Increase the value of the NDBD configuration parameter TransactionDeadlockDetectionTimeout (seeMySQL reference manual, section 16.4.4.5). As the default value is 1200ms, increasing it to 5000ms mightbe a good start. You will need to add this line to the NDBD default section in your cluster configuration file:

[NDBD DEFAULT]TransactionDeadlockDetectionTimeout: 5000

2. Verify the state of the Dolphin Express by checking the logfile of the networkmanager (/var/log/scinetworkmanager.log) to see if any interconnect events have been logged. If there are repeated loggederror events for which no clear reason (such as manual intervention or node shutdown) can be determined,you should test the interconnect using sciadmin (see Section 4.2, “Traffic Test”).

3. If no events have been logged, it is very unlikely that the interconnect or SuperSockets are the cause of theproblem. Instead, you should try to verify that no node is overloaded or stuck.

1.3. SCI Transporter

Prior to SuperSockets, MySQL Cluster could use SCI for communication by means of the SCI Transporter, offeredby MySQL. The SCI Transporter is a SISCI-based data communication channel. However, performance withSuperSockets is significantly better than with the SCI Transporter, which is no longer maintained. Platforms thathave SuperSockets available should use those instead of SCI Transporter.

Page 57: DIS Install Guide Book

MySQL Operation

50

2. MySQL ReplicationSuperSockets do also significantly increase the performance in replication setups (speedups up to a factor of 3have been reported).

All that is necessary is to make sure that all MySQL server processes involved run with the LD_PRELOAD variableset accordingly. No MySQL configuration changes are necessary.

Page 58: DIS Install Guide Book

51

Chapter 8. Advanced TopicsThis chapter deals with techniques like performance analysis and tuning, irregular topologies and debug tools. Formost installations, the content of this chapter is not relevant.

1. Notification on Interconnect Status ChangesThe network manager provides a mechanism to trigger actions when the state of the interconnect changes. Theaction to be triggered is a user-definable script or executable that is run by the network manager when the inter-connect status changes.

1.1. Interconnect Status

The interconnect can be in any of these externally visible states:

UP All nodes and interconnect links are functional.

DEGRAD-ED

All nodes are up, but one or more interconnect links have been disabled. Disabling links can eitherhappen manually via sciadmin, or through the network manager because of problems reported bythe node managers. In status DEGRADED, all nodes can still communicate via Dolphin Express,but the overall performance of the interconnect may be reduced.

FAILED One or more nodes are down (the node manager is not reachable via Ethernet), and/or a high numberof links has been disabled which isolates one or more nodes from the interconnect. These nodescan not communicate via Dolphin Express, but i.e. SuperSockets will fall back to communicate viaEthernet if it is available.

UNSTA-BLE

UNSTABLE is a state which is only visibly externally. If the interconnect is changing states fre-quently (i.e. because nodes are rebooted one after the other), the interconnect will enter the stateUNSTABLE. After a certain period of less frequent internal status changes (which are continuouslyrecorded by network manager), the external state will again be set to either UP, DEGRADED orFAILED.

While in status UNSTABLE, the network manager will enable verbose logging (to /var/log/scinetworkmanager.log) to make sure that no internal events are lost.

1.2. Notification Interface

When the network manager invokes the specified script or executable, it hands over a number of parameters bysetting environment variables. The content of these variables can be evaluated by the script or executable. Thefollowing variables are set:

DIS_FABRIC The number of the fabric for which this notification is generated. Can be 0, 1 or 2.

DIS_STATE The new state of the fabric. Can be either UP, DEGRADED, FAILED or UNSTABLE.

DIS_OLDSTATE The previous state of the fabric. Can be either UP, DEGRADED, FAILED or UNSTA-BLE.

DIS_ALERT_TARGET This variable contains the target address for the notification. This target address is pro-vided by the user when the notification is enabled (see below), and the user needs tomake sure that the content of this variable is useful for the chosen alert script. I.e., ifthe alert script should send an email, the content of this variable needs to be an emailaddress.

DIS_ALERT_VERSION The version number of this interface (currently 1). It will be increased if incompatiblechanges to the interface need to be introduced, which could be a change in the pos-sible content of an existing environment variable, or the removal of an environmentvariable. This is unlikely and does not necessarily make an alert script fail, but a script

Page 59: DIS Install Guide Book

Advanced Topics

52

that relies on this interface in a way where this matters needs to verify the content ofthis variable.

1.3. Setting Up and Controlling Notification

1.3.1. Configure Notification via the dishostseditor

Notification on interconnect status changes is done via the dishostseditor. In the Cluster Edit dialog, tick the checkbox above Alert target as shown in the screenshot below.

Then enter the alert target and choose the alert script by pressing the button and selecting the script in the filedialog. Dolphin provides an alert script /opt/DIS/etc/dis/alert.sh (for the default installation path) whichsends out an email to the specified alert target. Any other executable can be specified here. Please consider that thisscript will be executed in the context of the user running the network manager (typically root), so the permissionsto change this file should be set accordingly.

To make the changes done in this dialog effective, you need to save the configuration files (to /etc/dis on thefrontend) and then restart the network manager:

# service dis_networkmgr restart

1.3.2. Configure Notification Manually

If the dishostseditor can not be used, it is also possible to configure the notification by editing /etc/dis/networkmanager.conf. Notification is controlled by two options in this file:

-alert_script <file> This parameter specifies the alert script <file> to be executed.

-alert_target <target> This parameter specifies the alert target <target> which is passed to the cho-sen alert script.

To disable notification, these lines can be commented out (precede them with a #).

After the file has been edited, the network manager needs to be restarted to make the changes effective:

Page 60: DIS Install Guide Book

Advanced Topics

53

# service dis_networkmgr restart

1.3.3. Verifying Notification

To verify that notification is actually working, you should provoke a interconnect status change manually. Thiscan easily be done from sciadmin by disabling any link via the Node Settings dialog of any node.

1.3.4. Disabling and Enabling Notification Temporarily

Once the notification has been configured, it can be controlled via sciadmin. This is useful if the alerts shouldbe stopped for some time. To disable alerts, open the Cluster Settings dialog and switch the setting next to Alertscript as needed.

This is a per-session setting and will be lost if the network manager is restarted.

Warning

Make sure that the messages are enabled again before you quit sciadmin. Otherwise, interconnect statuschanges will not be notified until the network manager is restarted.

2. Managing IRM ResourcesA number of resources in the low-level driver IRM (service dis_irm) are run-time limited by parameter in thedriver configuration file /opt/DIS/lib/modules/<kernel version>/dis_irm.conf (for the default installationpath). This files contains numerous parameter settings; for those parameters that are relevant for changes by theuser, please refer to Section 3.1, “dis_irm.conf”.

Generally, to change (increase) default limits, dis_irm.conf file needs to be changed on each node. Typically,you should edit and test the changes on one node, and then copy the file over to all other nodes. To make changesin the configuration file effective, you need to restart the dis_irm driver. Because all other drivers depend on it, itis necessary to restart the complete software stack on the nodes:

# dis_services restart

2.1. Updates with Modified IRM Configuration

You need to be careful when updating RPMs on the nodes with a modified dis_irm.conf. If you directly useRPMs to update the existing Dolphin-SCI RPM like using rpm -U, the existing and modified dis_irm.conf will

Page 61: DIS Install Guide Book

Advanced Topics

54

be moved to dis_irm.conf.rpmsave, and the default dis_irm.conf will replace previously modified version.You will need to undo this file renaming.

If you update your system with the SIA as described in Chapter 4, Update Installation, SIA will take care that theexisting dis_irm.conf will be preserved and stay effective.

Page 62: DIS Install Guide Book

55

Chapter 9. FAQThis chapter lists problems that have been reported by customers and the proposed solutions.

1. Hardware1.1.1. Although I have properly installed the adapter in a node and its LEDs light orange, I am told (i.e. during the

installation) that this node does not contain an SCI adapter!

The SCI adapter might not have been recognized by the node during the power-up initialization after powerwas applied again. The specification requires that a node needs to be powered down for at least 5 secondsbefore being powered up again. To make the adapter be recognized again, you will need to power-down thenode (restarting or resetting is not sufficient!), wait for at least 5 seconds, and power it up again.

If this does not fix the problem, please contact Dolphin support.

1.1.2. All cables are connected, and all LEDs shine green on the adapter boards, all required services and driversare running on all nodes. However, some nodes can not see some other nodes via the SCI interconnect.Between some other pairs of nodes, the communication works fine.

These symptoms indicate that the cabling is not correct, i.e. the links 0 and 1 (x- and y-direction in a 2D-torus)are exchanged. To resolve the problem, proceed as follows:

1. Run the cable test from sciadmin (Server Test Cable Connections). If no problem is reported, pleasecontact Dolphin Support.

2. To fix the cable problem, dreate a cabling description via dishotseditor (File Get Cabling Instructions)and the cabling between the nodes that have been reported in the cable test.

3. Repeat step 1. and 2.until no more problems are reported.

1.1.3. The SCI driver dis_irm refuses to load, or driver install never completes. Running dmesg shows that thesyslog contains the line Out of vmalloc space. What's wrong?

The problem is that the SCI adapter requires more virtual PCI address space than supported by the installedkernel. This problem has so far only been observed on 32 bit operating systems. There are two alternativesolutions:

1. If you are building a small cluster you may be able to run your application with less SCI address space.You can change the SCI address space size for the adapter card by using sciconfig with the commandset-prefetch-mem-size. A value of 64 or 16 will most likely overcome the problem. This operationcan also be performed from the command line using the options -c to specify the card number (1 or 2)and -spms to specify the prefetch memory size in Megabytes:

# sciconfig -c 1 -spms 64Card 1 - Prefetch space memory size is set to 64 MBA reboot of the machine is required to make the changes take effect.

When rebooting the machine, the problem should be solved.

2. If reducing the prefetch memory size is not desired, the related resources in the kernel have to be in-creased. For x86-based machines, this is achieved by passing the kernel option vmalloc=256m and theparameter uppermem=524288 at boot time.

This is done by editing /boot/grub/grub.conf as shown in the following example:

title CentOS-4 i386 (2.6.9-11.ELsmp) root (hd0,0) uppermem 524288 kernel /i386/vmlinuz-2.6.9-11.ELsmp ro root=/dev/sda6 rhgb quiet vmalloc=256m initrd /i386/initrd-2.6.9-11.ELsmp.img

Page 63: DIS Install Guide Book

FAQ

56

2. Software2.1.1. The service dis_irm (for the low-level driver of the same name) fails to load after it has been installed for

the first time.

Please follow the procedure below to determine the cause of the problem.

1. Verify that the adapter card has been recognized by the machine. This can be done as follows:

[root@n1 ~]# lspci -v | grep Dolphin03:0c.0 Bridge: Dolphin Interconnect Solutions AS PSB66 SCI-Adapter D33x Subsystem: Dolphin Interconnect Solutions AS: Unknown device 2200

If this command does not show any output similar to the example above, the adapter card has not beenrecognized. Please try to power-cycle the system according to FAQ Q: 1.1.1. If this does not solve theissue, a hardware failure is possible. Please contact Dolphin support in such a case.

2. Check the syslog for relevant messages. This can be done as follows:

# dmesg | grep SCI

Depending on which messages you see, proceed as described below:

SCI Driver : Preallocation failed The driver failed to preallocate memory which will be used toexport memory to remote nodes. Rebooting the node is the sim-plest solution to defragment the physical memory space. If thisis not possible, or if the message appears even after a reboot,you need to adapt the preallocation settings (see Section 3.1,“dis_irm.conf”).

SCI Driver: Out of vmalloc

space

See FAQ Q: 1.1.3.

3. If the driver still fails to load, please contact Dolphin support and provide the driver's syslog messages:

# dmesg > /tmp/syslog_messages.txt

2.1.2. Although the Network Manager is running on the frontend, and all nodes run the Node Manager, configu-ration changes are not applied to the adapters. I.e., the node ID is not changed according to what is specifiedin /etc/dis/dishosts.conf on the frontend.

The adapters in a node can only be re-configured when they are not in use. This means, no adapter resourcesmust be allocated via the dis_irm kernel module. To achieve this, make sure that upper layer services thatuse dis_irm (like dis_sisci and dis_supersockets) are stopped.

On most Linux installations, this can be achieved like this (dis_services is a convenience script that comewith the Dolphin software stack):

# dis_services stop...# service dis_irm start...# service dis_nodemgr start...

2.1.3. The Network Manager on the frontend refuses to start.

In most cases, the interconnect configuration /etc/dis/dishosts.conf is corrupted. This can be verifiedwith the command testdishosts. It will report problems in this configuration file, as in the example below:

# testdishosts

socket member node-1_0 does not represent a physical adapter in dishosts.conf

Page 64: DIS Install Guide Book

FAQ

57

DISHOSTS: signed32 dishostsAdapternameExists() failed

In this case, the adapter name in the socket definition was misspelled. If testdishosts reports a problem, youcan either try to fix /etc/dis/dishosts.conf manually, or re-create it with dishostseditor.

If this does not solve the problem, please check /var/log/scinetworkmanager.log for error messages. Ifyou can not fix the problem reported in this logfile, please contact Dolphin support providing the contentof the logfile.

2.1.4. After a node has booted, or after I restarted the Dolphin drivers on a node, the first connection to a remotenode using SuperSockets does only deliver Ethernet performance. Retrying the connection then delivers theexpected SuperSockets performance. Why does this happen?

Make sure you run the node manager on all nodes of the cluster, and the network manager on the frontendbeing correctly set up to include all nodes in the configuration (/etc/dis/dishosts.conf). The optionAutomatic Create Session must be enabled.

This will ensure that the low-level "sessions" (Dolphin-internal) are set up between all nodes of the cluster,and a SuperSockets connection will immediately succeed. Otherwise, the set-up of the sessions will not bedone until the first connection between two nodes is tried, but this is too late for the first connection to beestablished via SuperSockets.

2.1.5. Socket benchmarks show that SuperSockets are not active as the minimal latency is much more than 10us.

The half roundtrip latency (ping-pong latency) with SuperSockets typically starts between 3 and 4us forvery small messages. Any value above 7us for the minimal latency indicates a problem with the SuperSock-ets configuration, benchmark methodology of something else. Please proceed as follows to determine thereason:

1. Is the SuperSockets service running on both nodes? /etc/init.d/dis_supersockets status should reportthe status running. If the status is stopped, try to start the SuperSockets service with /etc/init.d/dis_supersockets start.

2. Is LD_PRELOAD=libksupersockets.so set on both nodes? You can check using the ldd command. As-suming the benchmark you want to run is named sockperf, do ldd sockperf. The libksupersockets.soshould appear at the very top of the listing.

3. Are the SuperSockets configured for the interface you are using? This is a possible problem if you havemultiple Ethernet interfaces in your nodes with the nodes having different hostnames for each interface.SuperSockets may be configured to accelerate not all of the available interfaces.

To verify this, check which IP addresses (or subnet mask) are accelerated by SuperSockets by lookingat /proc/net/af_sci/socket_maps (Linux) and use those IP addresses (or related hostnames) that arelisted in this file.

4. If the SuperSockets service refuses to start, or only starts into the mode running, but not config-ured, you probably have a corrupted configuration file /etc/dis/dishosts.conf: verify that this fileis identical to the same file on the frontend. If not, make sure that the Network Manager is running onthe frontend (/etc/init.d/dis_networkmgr start).

5. If the dishosts.conf files are identical on frontend and node, they could still be corrupted. Please run thedishostseditor on the frontend to have it load /etc/dis/dishosts.conf; then save it again (dishostsed-itor will always create syntactically correct files).

6. Please check the system log using the dmesg command. Any output there from either dis_ssocks oraf_sci should be noted and reported to <[email protected]>.

2.1.6. I am running a mixed 32/64-bit platform, and while the benchmarks latency_bench and sockperf fromthe DIS installation show good performance of SuperSockets™, other applications do only show Ethernetperformance for socket communication.

Page 65: DIS Install Guide Book

FAQ

58

Please use the file command to verify if the applications that fail to use SuperSockets are 32-bit appli-cations. If they are, please verify if the 32-bit SuperSockets™ library can be found as /opt/DIS/lib/libksupersockets.so (this is a link). If this file is not found, then it could not be built due to a missing orincomplete 32-bit compilation environment on your build machine. This problem is indicated by the mes-sage #* WARNING: 32-bit applications may not be able to use SuperSockets of the SIA.

If on a 64-bit platform 32-bit libraries can not be built, the RPM packages will still be built successfully(without 32-bit libraries included) as many users of 64-bit platforms do only run 64-bit applications. To fixthis problem, make sure that the 32-bit versions of the glibc and libgcc-devel (packages) are installed onyour build machine, and re-build the binary RPM packages using the SIA option --build-rpm, making surethat the warning message shown above does not appear. Then, replace the existing RPM package Dolphin-SuperSockets with the one you have just build. Alternatively, you can perform a complete re-installation.

2.1.7. I have added the statement export LD_PRELOAD=libksupersockets.so to my shell profile to enable theuse of SuperSockets™. This works well on some machines, but on other machines, I get the error messageERROR: ld.so object 'libksupersockets.so' from LD_PRELOAD cannot be preloaded : ignore

whenever I log in. How can this be fixed?

This error message is generated on machines that do not have SuperSockets installed. On these machines,the linker can not find the libksupersockets.so library.

This can be fixed to set the LD_PRELOAD environment variable only if SuperSockets™ are running. For ash-type shell such as bash, use the following statements in the shell profile ($HOME/.bashrc):

[ -d /proc/net/af_sci ] && export LD_PRELOAD=libksupersockets.so

2.1.8. How can I permanetly enable the use of SuperSockets™ for a user?

This can be achieved by setting the LD_PRELOAD environment variable in the users' shell profile (i.e. $HOME/.bashrc for the bash shell). This should be done conditionally by checking if SuperSockets™ are runningon this machine:

[ -d /proc/net/af_sci ] && export LD_PRELOAD=libksupersockets.so

Of course, it is also possible to perform this setting globally (in /etc/profile).

2.1.9. I can not build SISCI applications that are able to run on my cluster because the frontend (where the SIS-CI-devel package was installed by the SIA) is a 32-bit machine, while my cluster nodes are 64-bit machines(or vice versa). I fail to build the SISCI applications on the nodes as the SISCI header files are missing.How can this deadlock be solved?

When the SIA installed the cluster, it has stored the binary RPM packages in different directories node_RPMS,frontend_RPMS and source_RPMS. You will find a SISCI-devel RPM that can be installed on the nodes inthe node_RPMS directory. If you can not find these RPM file, you can recreate them on one of the nodesusing the SIA with the --build-rpm option.

Once you have the Dolphin-SISCI-devel binary RPM, you need to install it on the nodes using the --forceoption of rpm because the library files conflict between the installed SISCI and the SISCI-devel RPM:

# rpm -i --force Dolphin-SISCI-devel.<arch>.<version>.rpm

Page 66: DIS Install Guide Book

59

Appendix A. Self-Installing Archive(SIA) ReferenceDolphin provides the complete software stack as a self-installing archive (SIA). This is a single file that containsthe complete source code as well as a setup script that can perform various operations, i.e. compiling, installingand testing the required software on all nodes and the frontend. A short usage information will be displayed whencalling the SIA archive with the --help option:

1. SIA Operating ModesThis section explain the different operations that can be performed by the SIA.

1.1. Full Cluster Installation

The full cluster installation mode will install and test the full cluster in a wizard-like guided installation. All re-quired information will be asked for interactively, and it will be tested if the requirements to perform the installa-tion are met. This mode is the default mode, but can also be selected explicitly.

Option: --install-all

1.2. Node Installation

Only build and install the kerrnel modules and Node Manager service needed to run an interconnect node. Ker-nel header files and the kernel configuration are required (package kernel-devel), but no GUI packages (like qt,qt-devel, X packages).

Option: --install-node

1.3. Frontend Installation

The frontend installation will only build and install those RPM packages on the local host that are needed tohave this host run as a frontend. However, due to limitations of the current build system, the kernel headers andconfiguration are still required for Linux systems.

Option: --install-frontend

1.4. Installation of Configuration File Editor

Build and install the GUI-based cluster configuration tool dishostseditor. This tool is used to define the topol-ogy of the interconnect and the placement of the nodes within this topology. With this information, it can cre-ate the detailed cabling instructions (useful to cable non-trivial cluster setups) and the cluster configuration filesdishosts.conf and networkmanager.conf needed on the frontend by the Network Manager.

Option: --install-editor

1.5. Building RPM Packages Only

Build all source and binary RPMs on the local machine. Both, kernel headers and configuration (kernel-devel) aswell as GUI development package (qt-devel) are needed on the local machine.

Option: --build-rpm

1.6. Extraction of Source Archive

It is possible to extract the sources from the SIA as a tar archive DIS.tar.gz in the current directory. This isrequired to build and install on non-RPM platforms, or when you want source-code access in general.

Page 67: DIS Install Guide Book

Self-Installing Archive(SIA) Reference

60

Option: --get-tarball

2. SIA OptionsNext to the different operating modes, a number of options are available that influence the operation. Not alloptions have an impact on all operating modes.

2.1. Node Specification

In case that you want to specify the list of nodes not interactively but on the command line, you can use the option--nodes together with a comma-separated list of hostnames and/or IP addresses to do so.

Example:

--nodes n01,n02,n03,n04

If this option is provided, existing configuration files like /etc/dis/dishosts.conf will not be considered.

2.2. Installation Path Specification

By default, the complete software stack will be installed to /opt/DIS. To change the installation path, use the--prefix option.

Example:

--prefix /usr/dolphin

This will install into /usr/dolphin. It is recommended to install into a dedicated directory that is located on alocal storage device (not mounted via the network). When doing a full cluster install (--install-all, or defaultoperation), the same installation path will be used on all nodes, the frontend and potentially the installation machine(if different from the frontend).

2.3. Installing from Binary RPMs

If you are re-running an installation for which the binary RPM package have already been built, you can savetime by not building these packages again, but use the existing ones. The packages have to be placed in twosubdirectories node_RPMS and frontend_RPMS, just as the SIA does. Then, provide the name of the directorycontaining these two subdirectories to the installer using the --use-rpms option.

Example:

--use-rpms $HOME/dolphin

The installer does not verify if the provided packages match the installation target, but the RPM installation itselfwill fail in this case.

2.4. Preallocation of SCI Memory

It is possible to specify the number of Megabytes per node that the low-level interconnect driver dis_irm shouldallocate on startup for exportable memory segments. The amount of this memory determines i.e. how many Su-perSocket™-based sockets can be opened. The default setting was chosen to work well with clustered databases.Change this setting if you know you will need more memory to be exported, or will use a very high number ofstream sockets per node (datagram sockets are multiplexed and thus need less resources).

By default, the driver allocates 8 + N*MB Megabytes of memory, with N being the number of nodes in the clusterand MB = 4 by default. A maximum of 256MB will be allocated. The factor MB can be specified on installationusing the --prealloc option.

Example:

Page 68: DIS Install Guide Book

Self-Installing Archive(SIA) Reference

61

--prealloc 8

On a 16 node cluster, this will make the dis_irm allocate 8 + 16*8 = 136MB on each node.

Note

The operating system can not use preallocated memory for other purposes - it is effectively invisible.

Setting MB to -1 will disable all modifications to this configuration option, and the fixed default of 16MB will bepreallocated independently from the number of nodes. Setting MB to 0 is also valid (8 MB will be allocated).

This option changes a value in the module configuration file dis_irm.conf. It is only effective on an initialinstallation. An existing configuration file dis_irm.conf will never be changed, i.e. when upgrading an existinginstallation.

2.5. Enforce Installation

If the installed packages should be replaced with the packages build from the SIA you are currently using evenif the installed packages are more recent (have a higher version number), use the option --enforce. This willenforce the installion of the same software version (the one delivered within this SIA) on all nodes and the frontendno matter what might be installed on any of these machines. Examples:

--enforce

2.6. Configuration File Specification

When doing a full cluster install, the installation script will automatically look for the cluster configuration filesdishosts.conf and networkmanager.conf in the default path /etc/dis on the installation machine. If thesefiles are not stored in the default path (i.e. because you have created them on another machine or received themfrom Dolphin and stored them someplace else), you can specify this path using the --config-dir option.

Example:

--config-dir /tmp

The script will look for both configuration files in /tmp.

If you need to specify the two configuration files being stored in different locations, use the options --dishosts-conf <filename> and --networkmgr-conf <filename>, respectively, to specify where each of the configura-tion files can be found.

2.7. Batch Mode

In case you want to run the installation unattended, you can use the --batch option to have the script assumethe default answer for every question that is asked. Additionally, you can avoid most of the console output (butstill have the full logfile) by providing the option --quiet. This option can be very useful if you are upgrading analready installed cluster. I.e., to enforce the installation of newly compiled RPM packages and reboot the nodesafter the installation, you could issue the following command on the frontend:

# ./DIS_install_<version> --batch --reboot --enforce >install.log

After this command returns, your cluster is guaranteed to be freshly installed unless any error messages can befound in the file install.log.

2.8. Non-GUI Build Mode

When building RPMs only (using the --build-rpm option), it is possible to specify that no GUI-applications(sciadmin and dishostseditor) should be build. This is done by providing the --disable-gui option. Thisremoves the dedency on the QT libraries and header files for the build process Example:

--disable-gui

Page 69: DIS Install Guide Book

Self-Installing Archive(SIA) Reference

62

2.9. Software Removal

To remove all software that has been installed via SIA, simply use the --uninstall option:

--uninstall

This will remove all packages from the node, and stop all drivers (if they are not in use). A more thorough cleanup,including all configuration data and possible remainings of non-SIA installations, can be achieved with the --wipeoptions:

--wipe

This option is a superset of --uninstall.

Page 70: DIS Install Guide Book

63

Appendix B. sciadmin Reference1. Startup

Check out sciadmin -h for startup options. In order to connect to the network manager you may either start sciadminwith the -cluster option, or choose the Connect button after the startup is complete. Type the hostname orIP-address of the machine running the Dolphin Express network manager.

Note

Only one sciadmin process can connect to the network manager at any time. If you should ever needto connect to the network manager while another sciadmin process is blocking the connection, you canrestart the network manager to terminate this connection. Afterwards, you can connect to the networkmanager from your sciadmin process (which needs to be running on a different machine than the othersciadmin process).

2. Interconnect Status View

2.1. Icons

As a visual tool, sciadmin uses icons to let the user trigger actions and to display information by changing the iconshape or color. The icons with all possible states are listed in the tables below.

Table B.1. Node or Adapter State

Dolphin Network Manager has a valid connection toDolphin Node Manager on the node.

Dolphin Network Manager cannot reach the node usingTCP/IP, adapter is wrongly configured, broken or thedriver is in an invalid state.

Adapter has gone into a faulty state where it cannot readsystem interrupts and has been isolated by the DolphinIRM driver.

Page 71: DIS Install Guide Book

sciadmin Reference

64

Table B.2. Link State

Green pencil strokes indicates that the links are up.

Red pencil strokes indicate that a link is broken. Typi-cally, a cable is unplugged, or not seated well.

Yellow pencil strokes indicate that links have been dis-abled. Links are typically disabled when there are bro-ken cables somewhere else in the ringlet and automaticfail over recovery has been enabled.

A red dot (cranberry) indicates that the node has lostconnectivity to other nodes in the cluster.

Blue pencil strokes indicate that an administrator haschosen to disable this link in sciadmin. This may be doneif you want to debug the cluster.

2.2. Operation

2.2.1. Cluster Status

The area at the top right informs about the current cluster status and shows settings of sciadmin and the connectednetwork manager. A number of settings can be changed in the Cluster Settings dialog that is shown when pressingthe Settings button.

• Fabric status shows the current status of the fabric, UP, DEGRADED, FAILED or UNSTABLE (see below).

• Check Interval SCIAdmin shows the number of seconds between each time the Network Manager sends updatesto the Dolphin Admin GUI.

• Check Interval Network Manager shows the number of seconds between each time the Network Manager re-ceives updates from the Node Managers.

• Topology shows the current topology of the fabric.

• Auto Rerouting shows the current automatic fail over recovery settings (On, Off or default).

Fabric is UP when all nodes are operational and all links ok and therefore plotted in green.

Page 72: DIS Install Guide Book

sciadmin Reference

65

Figure B.1. Fabric is UP

Fabric is DEGRADED when all nodes are operational, some links are broken, but we still have full connectivity.In the snapshot below, the input cable of link 0 of node tiger 1, which is the output cable at node tiger-3, is defunct(this typically means unplugged) and therefore the link is plotted in red. The other links on this ring have becomedisabled by the network manager and therefore plotted in yellow. All other links are functional and plotted in green.

To get more information on the interconnect status for node tiger-1, get its diagnostics via Node Diag -V 1.

Page 73: DIS Install Guide Book

sciadmin Reference

66

Figure B.2. Fabric is DEGRADED

Fabric is in status FAILED if several links are broken in a way that breaks the full connectivity. In this case, theinput cable of link 0 and the output cable of link 1 are defunct. Node tiger-1 can not communicate via SCI in thissituation, and SuperSockets-driven sockets will have fallen back to Ethernet.

Because the cluster is cabled in an interleaved pattern, the link 1 output cable of tiger 1 is the link 1 input cableof tiger-4, and not tiger-7 as it would be the case for non-interleaved cabling.

Page 74: DIS Install Guide Book

sciadmin Reference

67

Figure B.3. Fabric has FAILED due to loss of connectivity

The fabric status is also set to FAILED if one or more nodes are dead as this node can not be reached via SCI.The reason for a node being dead can be

• Node is not powered up. Solution: power up the node.

• Node has crashed. Solution: reboot the node.

• The IRM low-level driver is not running. Solution: start the IRM driver like

# service dis_irm start

• The node manager is not running. Solution: start the node manager like

# service dis_nodemgr start

• The adapter is in an invalid state or is missing. Please check the node, and also consider the related topic inthe FAQ (Q: 1.1.1).

Page 75: DIS Install Guide Book

sciadmin Reference

68

Figure B.4. Fabric has FAILED due to dead nodes

2.2.2. Node Status

The status of a node icon tells if a node is up/dead or if a link is broken, disabled (either by the network manageror manually) or up. When selecting a node you will see details in the Node Status area:

• Serial number. A unique serial number given in production.

• Adapter Type: The Dolphin part number of the adapter

• Adapter number: The number of the adapter selected

• SCI Link 0: current status of link 0 (enabled or disabled)

• SCI Link 1: current status of link 1 (enabled or disabled)

3. Node and Interconnect Control

3.1. Admin Menu

The items in the Admin menu specifies information that are relevant for the Dolphin Admin GUI

Page 76: DIS Install Guide Book

sciadmin Reference

69

Figure B.5. Options in the Admin menu

• Connect to the network manager running on the local or a remote machine.

• Disconnect from the network manager.

• Refresh Status of the node and interconnect (instead of waiting for the update interval to expire).

• Switch to Debug Statistics View will show the value of selected counters of each adapter instead of the nodeicons which is useful for debugging fabric problems.

3.2. Cluster Menu

The commands in the cluster menu are executed on all nodes in parallel and the results are displayed by sciadmin.When choosing one of the fabric options the command will be executed on all nodes in that fabric.

Figure B.6. Options in the Cluster menu

Each fabric in the cluster has a sub-menu Fabric <X>. Within this sub-menu, the Diag (-V 0), Diag (-V 1), Diag (-V9) are diagnostics functions that can be used to get more detailed information about a fabric that shows problemsymptoms.

• Diag (-V 0) prints only errors that have been found.

• Diag (-V 1) prints more verbose status information (verbosity level 1).

• Diag (-V 9) prints the full diagnostic information including all error counters (verbosity level 9).

Page 77: DIS Install Guide Book

sciadmin Reference

70

• Diag -clear clears all the error counters in the Dolphin Express interconnect adapters. This helps to observe iferror counters are changing.

• Diag -prod prints production information about the Dolphin Express interconnect adapters (serial number, cardtype, firmware revision etc)

• The Test option is described in Section 4.2, “Traffic Test”

The other commands in the Cluster menu are:

• Settings displays the Cluster Settings dialog (see below).

• Reboot cluster nodes reboot all cluster nodes after a confirmation.

• Power down cluster nodes powers down all cluster nodes after a confirmation.

• Toggle Network Manager Verbose Settings to increase/decrease the amount of logging from the Dolphin Net-work Manager

• Select the Arrange Fabrics option to make sure that the different adapters in your hosts are connected to thesame fabric. This option is only displayed for clusters with more than one fabric.

• Test Cable Connections is described in Section 4.1, “Cable Test”t

3.3. Node Menu

The options in the Node menu are identical to the options in the Cluster and Cluster Fabrics <X> menu, only thatcommands are executed on the selected node only. The only additional option is Settings that is described in theSection 3.5, “Adapter Settings”.

Figure B.7. Options in the Node menu

3.4. Cluster Settings

The Dolphin Interconnect Manager provides you with several options on how to run the cluster.

Page 78: DIS Install Guide Book

sciadmin Reference

71

Figure B.8. Cluster configuration in sciadmin

• Check Interval Admin alters the number of seconds between each time the Network Manager sends updatesto the SCIAdmin GUI.

• Check Interval Network Manager alters the number of seconds between each time the Network Manager re-ceives updates from the Node Managers.

• Topology lets you select the topology of the cluster, while Topology found displays the auto-determined topol-ogy. Changes to the topology setting can be performed with dishostseditor.

• Auto Rerouting lets you decide to enable automatic fail over recovery (On), choose to freeze the routing to acurrent state (Off), or use the default routing tables in the driver (Default), the latter also means that no automaticrerouting will take place.

• Nodes in X,Y,Z dimension shows how the interconnect is currently dimensioned. Changes to the dimensionsettings can be performed with dishostseditor.

• Remove Session to dead nodes lets you decide whether to remove the session to nodes that are unavailable.

• Wait before removing session defines the number of seconds to wait until removing sessions to a node that hasdied or became inaccessible by other means.

• Automatic Create Sessions to new nodes lets you decide if the Network Manager shall create sessions to allavailable nodes.

• Alert script lets you choose to enable/disable the use of a script that may alert the cluster status to an admin-istrator.

Page 79: DIS Install Guide Book

sciadmin Reference

72

3.5. Adapter Settings

The Advanced Settings button in the node menu allows you to retrieve more detailed information about an adapterand to disable/enable links of this adapter.

Figure B.9. Advanced settings for a node

• Link Frequency sets the frequency of a link. It is not recommended to change the default setting.

• Prefetch Memsize shows the maximum amount of remote memory that can be accessed by this node.

A changed value will not become effective until the IRM driver is restarted on the node, which has to be doneoutside of sciadmin. Setting this value too high (> 512MB) can cause problem with some machines, especiallyfor 32bit platforms.

• SCI LINK 0 / 1 / 2 allows to set the way a link is controlled:

• Automatic lets the network manager control the link to enable and disable it as required by the link and theinterconnect status.

• Disabled forces a link down. This is a per-session setting (the link will be under control of the networkmanager if it is restarted), and only required as a temporary measure for trouble shooting.

The disable link option can also be used as a temporary measure to disable an unstable adapter or ringlet sothat it does not impose unnecessary noise on the adapters. If such an unlikely event occurs, please contactDolphin support.

A manually disabled link is marked blue in the sciadmin interconnect display, as shown in the screenshotbelow.

Page 80: DIS Install Guide Book

sciadmin Reference

73

Warning

Please note that when Auto Rerouting is enabled (default setting), disabling a link within a ringlet willdisable the complete ringlet. Disabling to many links can thus isolate nodes from access to the DolphinExpress interconnect.

Figure B.10. Link disabled by administrator (Disabling the links on the machine withhostname tiger-5 takes down the corresponding links on the other machines that sharethe same ringlet.).

4. Interconnect Testing & Diagnosis

4.1. Cable Test

Test Cable Connections tests the cluster for faulty cabling by reading serial numbers and adapter numbers from theother nodes on individual rings only. Using this test, you can verify that the cables are connecting the right nodesvia the right ports, which means it servers to ensure that the physical cabling matches the interconnect descriptionin the dishosts.conf configuration file.

This test is very useful after a fresh installation, but also every time you worked on the cabling. It will only takea few seconds to complete and display its results in an editor. This allows you to copy or print the test result tofix the described problems right at the cluster.

Page 81: DIS Install Guide Book

sciadmin Reference

74

Warning

Please note that while this test is running, all traffic over the Dolphin Express interconnect will be blocked.Although this will not introduce any communication errors except the delay, it therefore is recommendedto run the test on an idle cluster.

SuperSockets will fall back to Ethernet while this test is running.

Figure B.11. Result of running cable test on a good cluster

Figure B.12. Result of cable test on a problematic cluster

Page 82: DIS Install Guide Book

sciadmin Reference

75

4.2. Traffic Test

The Test option for each fabric of a cluster verifies the connection quality of the links that make up the fabric. Itwill search for bad connections by imposing the maximum amount of traffic on individual rings and observe theinternal error counters of all adapters involved.

A typical problem that can be found with this test are not well-fitted cables as this will cause CRC errors on therelated link. Such CRC errors do not cause data corruption as the corrupted packet will be detected and retrans-mitted, but the performance will decrease to some degree. A fabric should show no errors on this test. If errorsare displayed, the cable connections between the affected nodes should be verified. For more information, pleasesee Section 4.7.5, “Fabric Quality Test”.

Note

To perform this test, the SISCI RPM has to be installed on all nodes. This is the case if the installationwas performed via SIA. If SISCI is not installed on a node, an error will be logged and displayed asshown below.

Warning

Please note that while this test is running, all traffic over the Dolphin Express interconnect will be blocked.Although this will not introduce any communication errors except the delay, it therefore is recommendedto run the test on an idle cluster.

SuperSockets will fall back to Ethernet while this test is running.

Page 83: DIS Install Guide Book

sciadmin Reference

76

Figure B.13. Result of fabric test without installing all the necessary rpms

Page 84: DIS Install Guide Book

sciadmin Reference

77

Figure B.14. Result of fabric test on a proper fabric

Page 85: DIS Install Guide Book

78

Appendix C. Configuration Files1. Cluster ConfigurationA cluster with Dolphin Express interconnect requires one combined configuration file dishosts.conf forthe interconnect topology and the SuperSockets acceleration of existing Ethernet networks, and another filenetworkmanager.conf that contains the basic options for the mandatory Network Manager. Both of these filesshould be created using the GUI tool dishostseditor. Using this tool vastly reduces the risk of creating an in-correct configuration file.

1.1. dishosts.conf

The file dishosts.conf is used as a specification of the Dolphin Express interconnect (in a way just like /etc/hosts specifies nodes on a plain IP based network). It is a system wide configuration file and should be locatedwith its full path on all nodes at /etc/dis/dishosts.conf. Templates of this file can be found in /opt/DIS/etc/dis/. A syntactical and semantic validation of dishosts.conf can be done with the tool testdishosts.

The Dolphin network manager and diagnostic tools will always assume that the current file dishosts.conf isvalid. If dynamic information read from the network contradicts the information read in the dishosts.conf file,Dolphin network manager and diagnostic tools will assume that components are mis configured, faulty or removedfor repair.

dishosts.conf is by default automatically distributed to all nodes in the cluster when the Dolphin network man-agement software is started. Therefore, do edit and maintain this file on the frontend only. You should create andmaintain dishosts.conf by using the dishostseditor GUI (Unix: /opt/DIS/sbin/dishostseditor). Normally,there is no reason to edit this file manually. To make changes in dishosts.conf effective, the network managerneeds to be restarted like

# service dis_networkmgr restart

In case that SuperSockets settings have been changed, dis_ssocks_cfg needs to be run on every node as Super-Sockets are not controlled by the network manager.

The following sections describe the keywords used.

1.1.1. Basic settings

DISHOSTVERSION [ 0 | 1 | 2 ] The version number of the dishosts.conf is specified after the keywordDISHOSTVERSION. The DISHOSTVERSION should be put on the firstline of the dishosts file that is not a comment. DISHOSTVERSION 0designates a very simple configuration (see Section 1.1.3, “MiscellaneousNotes”). DISHOSTVERSION > 0 maps hosts/IPs to adapters, which inturn are mapped to nodeids and physical adapter numbers by meansof the ADAPTER entries. DISHOSTVERSION 1 or higher is requiredfor running with multiple adapter cards and transparent fail over etc.DISHOSTVERSION 2 provides support for dynamic IP-to-nodeId map-pings (sometimes also refered to as virtual IPs).

HOSTNAME: <hostname/IP> Each cluster node is assigned a unique dishostname, which has to be beequal to its hostname. The hostname is typically the network name (as spec-ified in /etc/hosts), or the nodes IP-address.

HOSTNAME: host1.dolphinics.noHOSTNAME: 193.69.165.21

ADAPTER: <physical adapter-name> <nodeId> <adapterNo>

A Dolphin network node may hold several physical adapters. Informationabout a node's physical adapters is listed right below the hostname. Allnodes specified by a HOSTNAME need at least one physical adapter. This

Page 86: DIS Install Guide Book

Configuration Files

79

physical adapter has to be specified on the next line after the HOSTNAME.The physical adapters are associated with the keyword ADAPTER.

#Keyword name nodeid adapterADAPTER: host1_a0 4 0 ADAPTER: host1_a1 4 1

STRIPE: <virtual adaptername><physical adaptername 1> <physi-cal adaptername 2>

Defines a virtual striping adapter comprising two physical adapters, whichwill be used for automatic data striping (also referred to as channel bond-ing). Striping adapters will also be used as redundant adapters in the caseof network failure.

STRIPE: host1_s host1_a0 host1_a1

REDUNDANT: <virtual adapter-name> <physical adaptername 1><physical adaptername 2>

Defines a virtual redundant adapter comprising two physical adapters,which will be used for automatic fail over in case one of the fabrics fails.

REDUNDANT: host1_r host1_a0 host1_a1

1.1.2. SuperSockets settings

The SuperSockets configuration is responsible for mapping certain IP addresses to Dolphin Express adapters. Itdefines which network interfaces are enabled for Dolphin SuperSockets.

SOCKET: <host/IP> <physical orvirtual adptername>

Enables the given <host/IP> for SuperSockets using the specified adapter.

In the following example we assume host1 and host2 have two networkinterfaces each designated host1pub, host1prv, host2pub, host2prv, but on-ly the 'private' interfaces hostXprv are enabled for SuperSockets using astriping adapter:

SOCKET: host1prv host1_sSOCKET: host2prv host2_s

Starting with DISHOSTVERSION 2 SuperSockets can handle dynamic IP-to-nodeId mappings. I.e. a certain IPaddress does not need to be bound to a fixed machine but can roam in a pool of machines. The address resolutionis done at runtime. For such a configuration a new type of adapter must be specified:

SOCKETADAPTER: <socketadaptername> [ SINGLE | STRIPE| REDUNDANT ] <adapterNo> [<adapterNo> ... ]

This keyword basically only defines an adapter number, which is not asso-ciated to any nodeId.

Example:

SOCKETADAPTER: sockad_s STRIPE 0 1

Defines socket adapter "sockad_s" in striping mode using physical adapters0 and 1. The resulting internal adapter number is 0x2003.

Such socket adapters can now be used in order to define dynamic mappings,and, in extension to DISHOSTVERSION 1 whole networks can be speci-fied for dynamic mappings:

SOCKET: [ <ip> | <hostname> |<network/mask_bits> ] <socketadapter>

Enables the given address/network for SuperSockets and associates it witha socket adapter.

It is possible to mix dynamic and static mappings, but there must be noconflicting entries.

Example:

SOCKET: host1 sockad_sSOCKET: host2 sockad_sSOCKET: host3 host3_sSOCKET: 192.168.10.0/24 sockad_s

Page 87: DIS Install Guide Book

Configuration Files

80

1.1.3. Miscellaneous Notes

• Using multiple nodeids per node is supported.This can be used for some advanced high-availability switchcon-figurations.

• A short version of dishosts.conf is supported for compatibility reasons, which corresponds to DISHOSTVER-SION 0. Please note that neither virtual adapters nor dynamic socket mappings are supported by this format.

Example:

#host/IP nodeidhost1prv 4host2prv 8

1.2. networkmanager.conf

The networkmanager.conf specifies the startup parameters for the Dolphin Network Manager. It is created by thedishostseditor.

1.3. cluster.conf

This file must not be edited by the user. It is a configuration file of the Network Manager that consists of theuser-specified settings from networkmanager.conf and derived settings of the cluster (nodes). It is created by theNetwork Manager.

2. SuperSockets ConfigurationThe following sections describe the configuration files that specifically control the behaviour of Dolphin Super-Sockets™. Next to these files, SuperSockets retrieve important configuration information from dishosts.confas well.

To make changes in any of these file effective, you need to run dis_ssocks_cfg on every node. Changes do notapply to sockets that are already open.

2.1. supersockets_profiles.conf

This file defines system-wide settings for all SuperSockets™ applications using LD_PRELOAD. All settings canbe overridden by environment variables named SSOCKS_<option> (like export SSOCKS_DISABLE_FALLBACK=1).

SYSTEM_POLL [ 0 | 1 ] Usage of poll/select optimization. Default is 0 which means that the Super-Sockets optimization for the poll() and select() system calls is used.This optimization typically reduces the latency without increasing the CPUload. To only use the native system methods for poll() and select(), setthis value to 1.

RX_POLL_TIME <int> Receive poll time [µs]. Default is 30. Increasing this value may reduce thelatency as the CPU will spin longer to wait for new data until it blockssleeping. Reducing this value will send the CPU to sleep earlier, but thismay increase message latency.

TX_POLL_TIME <int> Transmit poll time [µs]. Default is 0, which means that the CPU does notspin at all when a no buffers at the receiving side are available. Instead,it will imeadeatly block until the receiver reads data from these buffers(which makes buffer space available again for sending). The situation ofno available receive buffers does rarely occur, and increasing this value isnot recommended.

MSQ_BUF_SIZE <int> Message buffer size [byte]. Default is 128KB. This value determines howmuch data can be sent without the receiver reading it. It has no significantimpact on bandwidth.

Page 88: DIS Install Guide Book

Configuration Files

81

MIN_DMA_SIZE <int> Minimum message size for DMA [byte]. Default is 0 (DMA disabled).

MAX_DMA_GATHER <int> Maximum number of messages gather into a single DMA transfer. Defaultis 1.

MIN_SHORT_SIZE <int> Switch point [byte] from INLINE to SHORT protocol. Default depends ondriver.

MIN_LONG_SIZE <int> Switch point [byte] from SHORT to LONG protocol. Default depends ondriver.

FAST_GTOD [ 0 | 1 ] Usage of accelerated gettimeofday(). Default is 0 which disables thisoptimization. Set to 1 to enable it.

DISABLE_FALLBACK [ 0 | 1 ] Control fallback from SuperSockets™ to native sockets. Default is 0, whichmeans fallback (and fallforward) is enabled. To ensure that only Super-Sockets™ are used (i.e. for benchmarking), set it to 1.

ASYNC_PIO [ 0 | 1 ] Usage of fully asynchronous transfers. Default is 1, which means that theSHORT and LONG protocol is processed by a dedicated kernel thread. Bythis, the sending process is available immedeatly, and the actual data trans-fer is performed asynchronously. This generally increases throughput andreduces CPU load with affecting small message latency. To disable asyn-chronous transfers, set this option to 0; in this case, all data transfers areperformed by the CPU that runs the process that called the send function.

2.2. supersockets_ports.conf

This file is used to configure the port filter for SuperSockets. If no such file exists all ports will be enabled bydefault. It is, however, recommended to exclude all system ports. A suitable port configuration file is part of theSuperSockets software package. You can adjust it to your specific needs.

# Default port configuration for Dolphin SuperSockets# Ports specifically enabled or disabled to run over SuperSockets.# Any socket not specifically covered, is handled by the default:

EnablePortsByDefault yes

# Recommended settings:

# Disable the privileged ports used by system services.DisablePortRange tcp 1 1023DisablePortRange udp 1 1023

# Disable Dolphin Interconnect Manager service ports. DisablePortRange tcp 3443 3445

The following keywords are valid:

EnablePortsByDefault [ yes | no ] Determines the policy for unspecified ports.

DisablePortRange [ tcp | udp ]<from> <to>

Explicitlely disables the given port range for the given socket type.

EnablePortRange [ tcp | udp ]<from> <to>

Explicitlely enables the given port range for the given socket type.

3. Driver ConfigurationThe Dolphin drivers are designed to adapt to the environment they are operating in; therefore, manual configurationis rarely required. The upper limit for memory allocation of the low-level driver is the only setting that may needto be adapted for a cluster, but this is also done automatically during the installation.

Page 89: DIS Install Guide Book

Configuration Files

82

Warning

Changing parameters in these files can affect reliability and performance of the Dolphin Express™ in-terconnect.

3.1. dis_irm.conf

dis_irm.conf is located in the lib/modules directory of the DIS installation (default /opt/DIS) and containsoptions for the hardware driver (dis_irm kernel module). Only few options are to be modified by the user. Theseoptions deal with the memory pre-allocation in the driver:

Warning

Changing other values in dis_irm.conf as those described below may cause the interconnect to mal-function. Only do so if instructed by Dolphin support.

Whenever a setting in this file is changed, the driver needs to be reloaded to make the new settings effective.

3.1.1. Resource Limitations

These parameters control memory allocations that are only performed on driver initialization.

Option Name Description Unit Valid Values DefaultValue

dis_max_segment_size_megabytes Sets the maximum sizeof a memory segmentthat can be allocated forremote access. Somesystems may lock up iftoo much memory is re-quested.

MiB integers > 0 4

max-vc-number Maximum number ofvirtual channels (onevirtual channel is need-ed per remote mem-ory connection; i.e. 2per SuperSocket con-nection)

n/a integers > 0

The upper limit isthe consumed memory;values > 16384 are typ-ically not necessary.

1024

3.1.2. Memory Preallocation

Preallocation of memory is recommended on systems without IOMMU (like x86 and x86_64). The problem isthe memory fragmentation over time which can cause problems to allocate large segments of contiguous physicalmemory after the system has been running for some time. To overcome this situation, options has been added to letthe IRM driver allocate blocks of memory upon initialization and to provide memory from this pool under certainconditions for allocation of remotely accessible memory segments..

Option Name Description Unit Valid Values DefaultValue

number-of-megabytes-preallocated Defines the number ofMiB memory the IRMshall try to allocate up-on initialization.

MiB 0: disable preallocation

>0: MiB to preallocatein as few blocks as pos-sible

16 (may beincreased bythe installerscript)

use-sub-pools-for-preallocation If the IRM fails toallocate the amountmemory specifiedby number-of-

n/a 0: disable sub-pools

1: enable sub-pools

1

Page 90: DIS Install Guide Book

Configuration Files

83

Option Name Description Unit Valid Values DefaultValue

megabytes-preallocat-ed it will by de-fault repetively de-crease the amountand retry untilsuccess. By en-abling use-sub-pools-for-preallocation theIRM will contin-ue allocate mem-ory (possibly insmall chunks), untilthe amount specifiedby number-of-megabytes-preallocat-ed is reached.

block-size-of-preallocated-blocks To allocate not a singlelarge block, but multi-ple blocks of the samesize, this parameter hasto be set to a value > 0.

Pre-allocating memorythis way is useful ifthe application to berun on the cluster us-es many memory seg-ments of the same (rel-atively small) size.

bytes 0: don't preallocatememory in this manner

> 0: size in bytes (willbe aligned upwards topage size boundary) ofeach memory block.

0

number-of-preallocated-blocks The number of block tobe preallocatd (see pre-vious parameter)

n/a 0: don't preallocatememory in this manner

> 0: number of blocks

0

minimum-size-to-allocate-from-preallocated-pool

This sets a lower lim-it on the size of mem-ory segments the IRMmay # try to allocatefrom the preallocatedpool. The IRM will all-ways request the systemfor additional memo-ry than resolving mem-ory request less thanthis size.Due to the as-pect of the need ofthe preallocation mech-anism, there is a "hard"lower limit of oneSCI_PAGE (currently8K). The mininum sizeis defined in 1K blocks.

KiB 0: always allocate frompre-allocated memory

> 0: try to allocatememory that is small-er than this value fromnon-preallocatd memo-ry

0

try-first-to-allocate-from-preallo-cated-pool

Directs the IRM whento try to use memo-

n/a 0: The preallocatedmemory pool becomesa backup solution, on-

1

Page 91: DIS Install Guide Book

Configuration Files

84

Option Name Description Unit Valid Values DefaultValue

ry from the preallocatedpool.

ly to be used whenthe system can't honora request for additionalmemory.

1:IRM to prefers to al-locate memory from thepreallocated pool whenpossible.

3.1.3. Logging and Messages

Option Name Description Unit Valid Values DefaultValue

link-messages-enabled Control logging of noncritical link messagesduring operation.

n/a 0: no link messages

1: show link messages

0

notes-disabled Control logging of noncritical notices duringoperation.

n/a 0: show notice mes-sages

1: no notice messages

0

warn-disabled Control logging of gen-eral warnings duringoperation.

n/a 0: show warning mes-sages

1: no warning messages

0

dis_report_resource_outtages Control logging of out-of-resource messagesduring operation.

n/a 0: no messages

1: show messages

0

notes-on-log-file-only Control printing ofdriver messages to thesystem console

n/a 0: also print to console

1: only print to syslog

0

3.2. dis_ssocks.conf

Configuration file for SuperSockets™ (dis_ssocks) kernel module. If a value different from the default is required,edit and uncomment the appropriate line.

#tx_poll_time=0;#rx_poll_time=30;#min_dma_size=0;#min_short_size=1009;#min_long_size=8192; #address_family=27;#rds_compat=0;

The following keywords are valid:

tx_poll_time Transmit poll time [µs]. Default is 0, which means that the CPU does not spin at all whena no buffers at the receiving side are available. Instead, it will imeadeatly block until thereceiver reads data from these buffers (which makes buffer space available again for send-ing). The situation of no available receive buffers does rarely occur, and increasing thisvalue is not recommended.

rx_poll_time Receive poll time [µs]. Default is 30. Increasing this value may reduce the latency as theCPU will spin longer to wait for new data until it blocks sleeping. Reducing this value willsend the CPU to sleep earlier, but this may increase message latency.

Page 92: DIS Install Guide Book

Configuration Files

85

min_dma_size Minimum message size for using DMA (0 means no DMA). Default is 0.

min_short_size Minimum message size for using SHORT protocol. Default and maximum is 1009.

min_long_size Minimum message size for using LONG protocol. Default is 8192.

address_family AF_SCI address family index. Default value is 27. If not set, the driver will automaticallychose another index between 27 and 32 until it finds an unused index. The chosen index canbe retrieved via the /proc file system like cat /proc/net/af_sci/family . If this valueis set explicitly, this value will be chosen, and no search for unused values is performed.

Generally, this value is only required if SuperSockets should be used explictely withoutthe preload library.

rds_compat RDS compatibility level. Default is 0.

Page 93: DIS Install Guide Book

86

Appendix D. Platform Issues andSoftware LimitationsThis chapter lists known issues of Dolphin Express with certain hardware platforms and limitations of the softwarestack. Some of these limitations can be overcome by changing default settings of runtime parameters to matchyour requirements.

1. Platforms with Known ProblemsIntel Chipset 5000V Due to a PCI-related bug in the chipset Intel 5000V, limitations on how this chipset

can be used with SCI have to be considered. This does only apply to very specificconfigurtations; one example of such a configuration is the Motherboard SuperMi-cro X7DVA-8 when the onboard SCSI HBA is used and the SCI adapter is placedin a specific slot.

If you plan to use such a platform, please contact Dolphin support in advance.

Intel IA-64 The write performance to remote memory within the kernel is low on IA-64 ma-chines, limiting the SuperSockets performance. We recommend to use x86_64 plat-forms instead.

2. IRMResource Limitations The IRM (Interconnect Resource Manager) manages the hardware and relat-

ed software resources of the Dolphin Express interconnect. Some resourcesare allocated once when the IRM is loaded. The default setting are sufficientfor typical cluster sizes and usage scenarios. However, if you hit a resourcelimit, it is possible to increase the

Maximum Number of Nodes

3. SuperSocketsUDP Broadcasts SuperSockets do not support UDP broadcast operations.

Heterogeneous Cluster Operation(Endianess)

By default, SuperSockets are configured to operate in clusters where allnodes use the same endian representation (either little endian or big endi-an). This avoids costly endian conversions and works fine in typical clus-ters where all nodes use the same CPU architecture.

Only if you mix nodes with Intel or AMD (x86 or x86_64) CPUs i.e.with nodes using PowerPC- or Sparc-based CPUs, this default setting won'twork. In this case, an internal flag needs to be set. Please contact Dolphinsupport if this situation applies to you.

Sending and Receiving Vectors The vector length for the writev() and sendmsg() functions is limited to16. For readv() and recvmsg(), the vector length is not limited.

Socket Options The following socket options are supported by SuperSockets™ for com-munication over Dolphin Express:

• SO_DONTROUTE (implicit, as SuperSockets don't use IP packets fordata transport and thus are never routable)

• TCP_NODELAY

Page 94: DIS Install Guide Book

Platform Issues andSoftware Limitations

87

• SO_REUSEADDR

• SO_TYPE

The following socket options are passed to the native (fallback) socket:

• SO_SENDBUF and SO_RECVBUF (the buffer size for the SuperSock-ets is fixed).

All other socket options are not supported (ignored).

Fallforward for Stream Sockets While SuperSockets offer fully transparent fall-back and fall-forward be-tween Dolphin Express-based communication and native (Ethernet) com-munication for any socket (TCP or UDP) while it is open and used, thereis currently a limitation on sockets when they connect: a socket that hasbeen created via SuperSockets and connected to a remote socket while theDolphin Express interconnect was not operational will not fall forward toDolphin Express communication when the interconnect comes up again.Instead, it will continue to communicate via the native network (Ethernet).

This is a rare condition that typically will not affect operation. If you suspectthat one node performs not up to the expectations, you can either contactDolphin support to help you diagnose the problem, or restart the applicationmaking sure that the Dolphin Express interconnect is up.

Removal of this limitation as well as a simple way to diagnose the precisestate of a SuperSockets-driven socket is scheduled for updated versions ofSuperSockets.

Resource Limitations SuperSockets allocate resources for the communication via Dolphin Ex-press by means of the IRM. Therefore, the resource limitations listed for theIRM indirectly apply to SuperSockets as well. To resolve such limitations ifthey occur (i.e. when using very large number of sockets per node), pleaserefer to the relevant IRM section above. SuperSockets logs messages to thesyslog for two typical out-of-resources situations:

• No more VCs available. The maximum number of virtual channels needsto be increased (see Section 2, “IRM”).

• No more segment memory available. The amount of pre-allocated mem-ory needs to be increased (see Section 2, “IRM”).