Cluster Development Group
AS OF 12/18/2003
December 2003
Dell’s High Availability Cluster Product Strategy This article outlines generic High Availability (HA) Cluster requirements and supported configurations. The paper also explains the logic behind a number of the rules and requirements for Dell supported HA solutions.
Overview
Ensuring access to information requires that applications and data meet stringent uptime requirements. As users demand that more services be available to them, the need to maximize application uptime has become common place. Developing High Availability (HA) solutions that support high levels of uptime while maintaining the simplicity of the Dell business model is challenging. Dell’s primary focus has been on composing solutions with Windows-based systems, but over the past year the need for high availability cluster solutions has emerged in the Linux market. Likewise, application clustering now enables a distributed approach to computing where multiple servers can logically be grouped into a cluster and viewed as a single entity like Oracle Real Application Clusters (RAC). To facilitate the adoption of credible solutions for Linux, Dell is implementing a set of cluster rules that will enable a common set of configurations that will be supported and sold around Microsoft Cluster Server (MSCS), Linux High Availability (Linux HA) and Oracle RAC cluster solutions. Dell’s High Availability Cluster Configuration rules are designed to ensure no single point of failure (SPOF) in the end-to-end cluster solution. This includes the standalone server, storage system, fabrics, paths, and applications. The Dell HA Cluster solutions are tested end-to-end to ensure maximum availability and reliability are available. The matrixes below present a generic outline of the various components that are supported in the Small Computer System Interface (SCSI) and Fibre Channel solutions for Windows, Linux HA and Oracle RAC based HA cluster configurations. As our HA programs mature, each new solution should use similar approaches and components. The various permutations and configurations go through stringent testing, which includes fault interjection to ensure that the entire solution is extensively developed in a highly stressful environment. The entire scope of the testing is conducted by the Dell HA cluster development groups. When issues are found, the teams work with the appropriate engineering teams and/or vendors to determine the root cause and develop a solution. Certifications – If required, then they are completed. However, not all solutions require a certification by the ISV. No Heterogenous Storage Components – As HA clustering is very dependent on the I/O subsystem, intermixing I/O components adds unacceptable risk to the configuration. Data integrity in cluster configurations must never be jeopardized. Current and previous (N & N – 1) Server configurations – Customer investment protection and migration paths for the latest OS and I/O subsystems. High Availability SCSI Cluster Solution SCSI-based HA cluster solutions are based on a cluster (server) failover configuration versus a path and cluster failover as supported in the Fibre Channel configurations. There is also a private (heartbeat) network that is a dedicated connection for communicating the cluster status between the cluster nodes. In Dell’s SCSI-based HA solution there is at least a single RAID controller in each cluster node (server). When a cluster node fails, the
node will failover to another cluster node. The matrix shown later in the article outlines the standard components of each Dell supported cluster. For example, under servers, N is equivalent to the currently shipping server such as the PE1750, N-1 would represent the PE1650. Under OSs, N is equivalent to Windows Server 2003 Enterprise Edition and N-1 represents Windows 2000 Advanced Server.
Diagram 1
High Availability Fibre Channel Cluster Solution Fibre Channel-based HA cluster solutions are based on path failover and credundant HBAs or paths to the storage, this provides for a higher level of asolution. Redundant Host Bus Adapters (HBAs) are required in each clustecoupled with redundant switches provide the ability to support redundant pexternal storage array. When a path fails, there will be a failover within the then the cluster can fail over to another node in the cluster. The HBAs in a versions of HBAs are not supported in a single cluster configuration, regard8 nodes in Windows 2003, Enterprise Edition.
Fibre Channel Switches
Diagram 2
2
Network
ServersExternal SCSI Storage
luster failover. By requiring vailability than a SCSI-based cluster r node (server). Redundant HBAs
aths and fabrics connected to the same cluster node. If both paths fail, cluster must be identical. Mixed less if you are implementing 2 up to
Servers (nodes)
TBU
External Fibre Channel Storage
Oracle Real Application Clusters (RAC) Application clustering enables additional functionalities for specific purposes. While database virtualization technologies such as Real Application Clusters are not yet as widespread as generic HA clustering technologies, they can provide a unique value proposition for a given application or deployment scenario. Oracle RAC is Oracle’s database clustering technology, whereby multiple servers can be grouped in an active-active cluster with shared data. As of today, RAC is the only technology that allows databases to scale out in a shared data model. Based on the RAC technology, any front-end application (such as OLTP applications, Oracle E-Business Suite, SAP, etc) can connect to the database cluster. RAC is therefore a platform for Oracle clustering at the database level.
Diagram 3
3
Matrix of Dell Supported Cluster Components Windows HA
Oracle RAC Linux HA
Product/Feature Win NT
EE W2K AS
W2K3, EE
RH 2.1 AS RHEL 3 RHEL 3
Configuration Rules Certifications X X X
N and/or N -1 Servers X X X
26x0 4600 64x0 66x0 8450
1750 26x0 4600 64x0 66x0
Multiple Clusters on a SAN X X X X X
Multiple Clusters Direct Attached CX600 CX600 Mixed Storage
(on a SAN) X X X Mixed Storage
(on a cluster) Single Path
Configs X X Dual Path
Configs X X X X X Homogeneous
HBA I/O (no mixing of HBA
cards) X X X X X Controllers
Emulex Single Channel
LP9002L X X X LP982 X X
QLogic Single Channel
QLA2200 X X X QLA2340 X X X X
Emulex Dual Channel
LP9802 QLogic Dual Channel
QLA2342 X X RAID Controllers
PERC 3/DC X X X X X PERC 4/DC X X X
Driver Changes
Requalification at a minimum. Certification done at next major release
4
Windows HA Oracle RAC Linux HA
Product/Feature Win NT
EE W2K AS W2K3
EE RHEL
2.1 RHEL 3 RHEL 3 Configuration Rules External Storage Power Vault TM
PV22xS X X Array Manager X X X
SATA SCSI PV650 X X PV660 X X X
FC4500 X X X FC4700-2 X X X X X
CX Series CX600 X X X X CX400 X X X X CX200 X X X X
CX200LC SATA FC Switches Brocade
8 Port X X X X X 16 Port X X X X X 32 Port
McData 8 Port
16 Port 32 Port
Flex Switch X X X Platforms
Blades SC
1P Tower 2P Tower X X X X X 4P Tower X X X X X 1P Rack 2P Rack X X X X X 4P Rack X X X X X
64 Bit 2P Rack Interconnect
On-board LOM X X X X X All add-in Ethernet NICs supported by
platform X X X X X Heterogeneous
Interconnect X X Homogeneous
Interconnect X X X X X
NIC Teaming
Public Network
Only
Public Network
Only
Public Network
Only X X
5
Win NT EE - Windows NT, Enterprise Edition W2K AS - Windows 2000, Advanced Server W2K3 EE - Windows Server 2003, Enterprise Edition RHEL 2.1 - Red Hat Enterprise License 2.1 RHEL 3 AS - Red Hat Enterprise License 3 Advanced Server Application Availability Because the application is critical, Dell focuses on understanding and proposing applications that are cluster aware, such as Microsoft Exchange and Microsoft SQL Server to name a few. By leveraging cluster aware applications, the clustering software can perform an operation to see if an application is responding. When it is not, the cluster software assumes the application is hung and the application attempts to restart on the same system. This is referred to as a local recovery. Local recoveries are quicker to perform than a failover to a backup server. Thus users are usually up and running quicker. Having failed all of these steps, the clustering software will fail resources over to another cluster node; this includes any applications. Node failover takes longer for the application to come up, but service will be restarted once the backup node and application are up and running.
Planned downtime can be managed in a more effective manner. Maintenance from a hardware as well as a software perspective can be performed on one of the servers while the other servers continue to provide the needed functionality for the users. No longer does this important task have to impact users or be performed at non-working timeframes.
As Dell’s high availability portfolio continues to expand, application monitoring and fault prevention are areas that continue to be a primary focus for improving application availability.
= High Availability
= Disaster Recovery
Conclusion
Dell is continuing to drive simplicity and standardization within the HA cluster market. Previously, High Availability clustering was considered difficult to plan, productize, implement, test, and sell. Throughout the past several years, Dell has standardized the Windows HA Clustering market, and is now planning to do so for the Linux HA clustering market and the Oracle RAC solutions.
6