32
Open Source Development Labs Carrier Grade Linux Clustering Requirements Definition Version 3.2 Prepared by the Carrier Grade Linux Working Group Open Source Development Labs, Inc. 12725 SW Millikan Way, Suite 400 Beaverton, OR 97005 USA Phone: +1-503-626-2455

Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Open Source Development Labs

Carrier Grade Linux

Clustering Requirements Definition

Version 3.2

Prepared by the Carrier Grade Linux

Working Group

Open Source Development Labs, Inc. 12725 SW Millikan Way, Suite 400 Beaverton, OR 97005 USA

Phone: +1-503-626-2455

Page 2: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Copyright (c) 2005 by The Open Source Development Labs, Inc. This material may be distributed only subject to the terms and conditions set forth in the Open Publication License, v1.0 or later (the latest version is available at http://www.opencontent.org/opl.shtml/). Distribution of substantively modified versions of this document is prohibited without the explicit permission of the copyright holder.

Other company, product, or service names may be the trademarks of others.

Linux is a Registered Trademark of Linus Torvalds.

Contributors to the Clustering Requirements Definition are (in alphabetical order):

Anderson, Tim (MontaVista) Badovinatz, Peter (IBM) Chacron, Eric (Alcatel) ** Chen, Terence (Intel) * Cherry, John (OSDL) Cress, Andrew (Intel) Dake, Steven (MontaVista) Haddad, Ibrahim (Ericsson) Ikebe, Takashi (NTT) Johnson, Christopher P. (Sun) Krauska, Joel (Cisco) Kukkonen, Mika (Nokia)

*Specification editor **Assistant specification editor

Comments on the contents of this document should be sent to [email protected] .

Page 3: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

i

1 Introduction to CGL Clustering Requirements........................................................1

1.1 CGL Clustering Technology ....................................................................................1 1.2 CGL Clustering Environment ..................................................................................2 1.3 Rationale for the CGL Clustering Requirements.....................................................4

2 Document Organization ..............................................................................................5 3 Requirements and Roadmap Definitions...................................................................5 4 Clustering Requirements.............................................................................................7

4.1 Service Availability Forum (SA Forum) APIs .........................................................7 CMS.1.0 Cluster Membership Service (SA Forum AIS) ......................... 7 CES.1.0 Cluster Event Service (SA Forum AIS) ..................................... 7 CCS.1.0 Cluster Checkpoint Service (SA Forum AIS) ............................ 8 CLS.1.0 Cluster Lock Service (SA Forum AIS)....................................... 8 CAF.1.0 Cluster Availability Framework (SA Forum AIS)..................... 8

4.2 Functional Requirements .........................................................................................9 CFH.1.0 Cluster Node Failure Detection.................................................. 9 CFH.2.0 Prevent Failed Node From Corrupting Shared Resources ....... 10 CFH.3.0 Application Fail-Over Enabling............................................... 10 CSM.1.0 Storage Network Replication .................................................. 10 CSM.2.0 Cluster-aware Volume Management for Shared Storage........ 10 CSM.3.0 Shared Storage Mirroring........................................................ 11 CSM.4.0 Redundant Cluster Storage Path.............................................. 11 CSM.6.0 Cluster File System ................................................................. 11 CCM.2 Cluster Communication Service ............................................ 12 CCM.2.1 Cluster Communication Service – Logical Addressing.......... 12 CCM.2.2 Cluster Communication Service – Fault Handling ................. 12 CCM.3.0 Redundant Cluster Communication Path................................ 12 CAF.2 Highly Available Network........................................................ 13 CAF.2.1 Ethernet MAC Address Takeover............................................ 13 CAF.2.2 IP Takeover .............................................................................. 13 CCS.2.0 SAF-AIS Data Checkpointing Performance ............................ 13 CMS.2.0 Dynamic Cluster Membership................................................. 13

4.3 Cluster Management..............................................................................................14 CMON.1 Cluster Node Monitoring..................................................... 14 CMON.1.1 Cluster Node HW Status Monitoring................................... 14 CMON.1.2 Cluster Events for Node Status Changes ............................. 14 CMON.1.3 Cluster-Wide Resource Monitor .......................................... 14 CDIAG.2 Cluster-Wide Diagnostic Info ............................................. 15 CDIAG.2.1 Cluster-Wide Identified Application Core Dump ............... 15 CDIAG.2.2 Cluster-Wide Kernel Crash Dump ...................................... 15 CDIAG.2.3 Cluster-Wide Log Collection .............................................. 15 CDIAG.2.4 Synchronized/Atomic Time Across Cluster........................ 15

5 Clustering Roadmap..................................................................................................16

Page 4: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

ii

5.1 Service Availability Forum (SA Forum) APIs .......................................................16 CCM.1.0 Cluster Message Service (SA Forum AIS) ............................. 16

5.2 Functional Requirements .......................................................................................16 CSM.5.0 Cluster Synchronized Device Hotswap................................... 16 CSM.7.0 Shared Storage Consistent Access........................................... 16 CCM.2 Cluster Communication Service ............................................ 17 CCM.2.3 Cluster Communication Service – Quality of Service ............ 17 CCM.2.4 Cluster Communication Service – Performance..................... 17 CCM.2.5 Cluster Communication Service – Event Notification............ 17 CCM.4 Group Messaging Protocol ..................................................... 18 CCM.4.1 Group Messaging Agreed Ordering........................................ 18 CCM.4.2 Group Messaging Safe Ordering............................................. 18 CCM.4.3 Group Messaging Membership Delivery Guarantee .............. 18 CMS.3.0 Single Node Multiple Clusters ................................................ 18 CAF.2 Highly Available Network........................................................ 19 CAF.2.3 Deliberate TCP Session Takeover ........................................... 19 CAF.2.4 TCP Session Takeover on Node Failure .................................. 19

5.3 Cluster Management..............................................................................................19 CMON.1 Cluster Node Monitoring..................................................... 19 CMON.1.4 Cluster-Wide Application Monitor .......................................19 CCON.1 Management Control ............................................................ 20 CCON.1.1 Run Diagnostics.................................................................... 20 CCON.1.2 Boot/Reboot nodes................................................................ 20 CCON.1.3 SW Upgrades ........................................................................ 20 CCON.1.4 SW Rolling Upgrades ........................................................... 21 CDIAG.1 Online Diagnostics ............................................................... 21 CDIAG.1.1 Online Diagnostics for Fans and Power Supplies ............... 21 CDIAG.1.2 Online Diagnostics for System Components ...................... 21

Appendices........................................................................................................................22 A.1. Clustering References...........................................................................................22 A.2. Definition of Terms...............................................................................................23

Page 5: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

1

1 Introduction to CGL Clustering Requirements The clustering requirements in this document are aimed at supporting clustered applications in a carrier-grade environment as an effective way to achieve highly-available services inside a network element. We specifically have not addressed the use of clustering to maximize performance, such as in the Beowulf and Mosix solutions.

1.1 CGL Clustering Technology Network service providers must continually update their network infrastructure to keep up with customer demands for high quality services that take advantage of ever-increasing Internet bandwidth. Providers prefer a solution that is scalable and cost effective, where scalability means adding more computation and networking bandwidth to existing devices.

Clustering multiple computing resources addresses system scalability by allowing new nodes to be added to an existing cluster to meet performance and scalability demands. Network equipment must be capable of providing 5 to 6 nines availability and should be able to withstand hardware, software, and firmware failures on individual compute nodes. Hardware redundancy of components, such as fans, power supplies, network connections, and RAID storage, is used to mitigate a subset of failures that would otherwise impact the availability of a service. Clustering addresses redundancy at a higher level and contributes to service availability: the goal is to eliminate any single point of failure in the cluster.

Clustering also implies distribution of computing resources. Clustering requirements for CGL are defined in terms of availability, maintenance, and performance. The main objectives of clustering architectures are as follows:

• Improved product performance. Clustering provides a framework that allows multiple computing resources to provide services.

• Increased product and service availability. Clustering increases service availability by preventing any element in the system from being a single point of failure.

• Enhanced product maintenance. Clustering provides a framework for maintenance tasks, such as node replacement and version upgrades (both hardware and software), enabling a cluster of individual systems to be managed as if it were a single system.

• More cost efficient solutions. Clustering enables cost efficiency due to the sharing and pooling of resources within a cluster and by focusing on using commercial off-the-shelf hardware to achieve cluster performance requirements.

Clustering technologies can be divided into several types, which are all related to a certain degree, yet serve different purposes. These types include High Performance Computing (HPC) cluster, scalability cluster, High Availability Cluster (HAC), and Server Consolidation Cluster.

Page 6: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

2

In the process of creating this CGL 3.x clustering specification, we surveyed the member companies to determine how they use cluster applications in their production environment and what is important to them in terms of using carrier grade Linux in their cluster applications. We then produced a set of white papers describing cluster usage models. We studied these cluster usage models and found:

• The need for clustering in carrier environments is common for application availability.

• Different carrier applications use different types of cluster technologies and deployment models. A “one size fits all” clustering model does not exist.

• Common standard interfaces are needed for carrier applications to communicate with the cluster engine and among cluster nodes.

• The priorities for addressing cluster models for a carrier grade system are:

1. High Availability Cluster (HAC)

2. Scalability Cluster

3. Server Consolidation Cluster

4. High Performance Computing (HPC) Cluster

As a result, CGL 3.x clustering requirements will start by focusing on High Availability Clustering (HAC) and then extend to other types of clustering technologies. This requirements specification will also focus on identifying the clustering infrastructure and building blocks that are common to all categories of clustering technologies. Well-established interfaces and APIs standards, designed to address the availability of the clustering system will be leveraged for requirements in this document.

Appendix A.2 contains definitions of high availability and clustering terms in the context of carrier grade Linux.

1.2 CGL Clustering Environment We learned from our usage model study that no one clustering model fits and meets the needs of all carrier applications. We are not going to create such model. Instead, a more generalized CGL clustering model is presented in this document that serves to identify the functional need of each component of a High Availability Cluster environment. This general model is illustrated in the diagram below, which shows the need for redundancy, stateful failover, and shared storage in a cluster application. This diagram is not a topology of any specific cluster deployment. It is up to application developers and system administrators to determine the usage and configuration of their cluster systems.

Page 7: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

3

Figure 1. HAC Cluster Functional View The functions shown in Figure 1 are described below:

• 1+1 Hot Standby Cluster is composed of one active primary node and one hot standby node and possibly a set of shared storage. It includes redundant paths between cluster nodes and to the storage.

• Shared Storage provides a set of mirrored disks (for redundant data) and can be achieved with software or hardware.

• Redundant Paths include the multiple communication paths between cluster nodes (CCPs) and the multiple paths from a node to access the storage (CSPs).

• N+M Cluster is the extension of a 1+1 hot standby cluster. In this model, the cluster can be configured with additional hot or cold standby nodes as needed by the application. Functional needs of the data check pointing capability and the access to the shared storage remain the same.

• Data Check Pointing is part of the cluster services. It constantly synchronizes the in-memory states and data of an application allowing the cluster to provide stateful failover of the application from one node to another node.

• Access Shared Storage – A cluster application stores and retrieves application data to and from the redundant shared storage. These data are persistent on the mirrored disks.

Page 8: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

4

• Service Entry Point Director routes and directs which cluster node shall provide the service to the service requester.

• Custer Management Console is a node in the system that manages all cluster nodes, but is not part of the cluster membership. It provides a view of the cluster to an operator. It monitors the hardware status of the cluster nodes and monitors cluster events such as cluster node failure. The operator can use it to perform some cluster node failure recovery functions, such as the re-boot of a cluster node allowing the node to re-join the cluster membership.

• Users are the service requesters. A user can be a human being, an external device, or another computer system (see Definition of Terms in Appendix A.2).

End users of carrier grade equipment have prioritized the need for HAC cluster configurations as:

1. 2-node (active/hot standby) cluster that supports: o Checkpointing of in-memory application states for rapid application

failover o Shared storage access from a single node at a time. o Redundant access to shared storage from a single node o Redundant inter-node communication paths

2. 2-node (active/active) cluster that supports: o Concurrent access to shared storage.

3. N node (active/active) cluster that supports: o Storage “scalability” o Improved service performance in accessing shared storage.

4. N+M node (active/hot or cold standby) cluster that supports: o Extension of active/standby pair.

1.3 Rationale for the CGL Clustering Requirements The requirements described in this document are intended to be independent of specific projects, products, or implementations.

The cluster requirements are framed around industry standard application programming interfaces. For these clustering requirements, the SA Forum Application Interface Specification will be used. The SA Forum AIS services that apply to this specification are:

• SA Cluster Membership Service API (Chapter 6)

• SA Checkpoint Service API (Chapter 7)

• SA Event Service API (Chapter 8)

• SA Message Service API (Chapter 9)

• SA Lock Service API (Chapter 10)

Page 9: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

5

The Availability Management Framework API (Chapter 5) provides the following services to SA-aware applications:

• Registration and unregistration

• Health monitoring

• Availability management

• Protection group management

• Error reporting

Other requirements are described in this document that are not related to cluster application APIs, but define requirements that are needed in a cluster. These include items such as shared storage support, synchronized time, and cluster management functions such as monitoring, control, and diagnostics. Items such as a clustered file system and clustered volume manager are also included in this document as they are essential building blocks for HA clustering, although they have no established APIs.

2 Document Organization This document is a section of the OSDL Carrier Grade Linux Requirements Definition Version 3.2, which is organized into the separately published sections listed below:

Overview of Requirements Version 3.2

Availability Requirements Definition Version 3.2

Clustering Requirements Definition Version 3.2

Hardware Requirements Definition Version 3.2

Performance Requirements Definition Version 3.2

Security Requirements Definition Version 3.2

Serviceability Requirements Definition Version 3.2

Standards Requirements Definition Version 3.2

Released versions of these sections can be found at http://www.osdl.org/lab_activities/carrier_grade_linux/documents.html/document_view.

3 Requirements and Roadmap Definitions The contents of each section are organized into:

• Requirements – Describes requirements necessary for a CGL system

• Roadmap – Highlights possible future requirements

Each requirement or roadmap item is described as follows:

ID A unique identification number including:

• An acronym identifying a category for the requirement (first field)

Page 10: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

6

• An ID number for the requirement (second field)

• An ID number for a sub-requirement (third field). A “0”in this field indicates the requirement is a stand-alone requirement. An empty field indicates the requirement is a summary requirement with sub-requirements. A number in this field indicates this requirement is a sequentially numbered sub-requirement

A summary requirement is also indicated by bolding the header of the requirement.

Name Short description of the requirement

Category The category to which the requirement is assigned. The categories for Clustering are:

CMS.x.x Cluster – Membership Service CES.x.x Cluster – Event Service CCS.x.x Cluster – Checkpoint Service CCM.x.x Cluster – Communication and Messaging CLS.x.x Cluster – Lock Service CAF.x.x Cluster – Availability Framework CMON.x.x Cluster – Cluster Management (Monitoring) CCON.x.x Cluster – Cluster Management (Control) CDIAG.x.x Cluster – Cluster Management (Diagnostics) CSM.x.x Cluster – Shared Storage Management CFH.x.x Cluster – Fault Handling

Description Detailed description of the requirement.

Page 11: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

7

4 Clustering Requirements This section contains requirements specific to supporting clustering. Most are new in this version of the OSDL Carrier Grade Linux Requirements Specification.

4.1 Service Availability Forum (SA Forum) APIs The Service Availability Forum Application Interface Specification is the defacto standard used for application interfaces to cluster services. The requirements described below each address one of the services defined in the SA Forum AIS specification. The Carrier Grade Linux Standards Requirements Definition also references the following SA Forum specifications:

• STD.8.0 – SA Forum Specifications http://www.saforum.org

• STD.8.1 – SA Forum AIS http://www.saforum.org

ID Name Category

CMS.1.0 Cluster Membership Service (SA Forum AIS) Clustering – Membership Service

Description: OSDL CGL specifies that carrier grade Linux shall provide a cluster membership service that conforms to the SA Forum Application Interface Specification (AIS).

The SAF-AIS membership service provides access to the cluster membership information, which may change dynamically as nodes join or leave the cluster. It provides processes with a means of retrieving information about the cluster nodes and the cluster membership and enables the process to receive membership change notifications as they occur.

ID Name Category

CES.1.0 Cluster Event Service (SA Forum AIS)

Clustering – Event Service

Description: OSDL CGL specifies that carrier grade Linux shall provide a cluster event service that conforms to the SA Forum Application Interface Specification (AIS).

The SAF-AIS event service is a publish/subscribe communication service based on the concept of event channels that enables asynchronous communication between publishers and subscribers. Events that are published by a process send events to any process in the cluster that has subscribed to that event channel, providing a means of inter-node, inter-process communication.

Page 12: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

8

ID Name Category

CCS.1.0 Cluster Checkpoint Service (SA Forum AIS)

Clustering – Checkpoint Service

Description: OSDL CGL specifies that carrier grade Linux shall provide a cluster checkpoint service that conforms to the SA Forum Application Interface Specification (AIS).

The SAF-AIS checkpoint service provides an efficient facility for software processes to record checkpoint data incrementally. During recovery from a failure, the checkpoint service is used to retrieve the previous checkpoint data and resume execution from the state recorded before the failure. The checkpoint service stores checkpoint information in main memory and replicates the checkpoint data on multiple cluster nodes.

ID Name Category

CLS.1.0 Cluster Lock Service (SA Forum AIS)

Clustering – Lock Service

Description: OSDL CGL specifies that carrier grade Linux shall provide a cluster lock service that conforms to the SA Forum Application Interface Specification (AIS).

The SAF-AIS lock service is a distributed lock service, suitable for use in a cluster, where processes in different nodes might compete with each other for access to a shared resource. Applications and services that can benefit from using a distributed lock manager are transaction-oriented, such as a database, a file system, or a resource controller/manager.

ID Name Category

CAF.1.0 Cluster Availability Framework (SA Forum AIS)

Clustering – Availability Framework

Description: OSDL CGL specifies that carrier grade Linux shall provide a cluster availability framework that conforms to the SA Forum Application Interface Specification (AIS).

The Availability Management Framework API provides the following services to SA-aware applications:

• Registration and un-registration

• Health monitoring

• Availability management

• Protection group management

• Error reporting

Page 13: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

9

4.2 Functional Requirements ID Name Category

CFH.1.0 Cluster Node Failure Detection Clustering – Fault Handling

Description: OSDL CGL specifies that carrier grade Linux shall provide a fast, communication-based cluster node failure mechanism that is reflected in a cluster membership service. At a minimum, the cluster node failure mechanism maintains a list of the nodes that are currently active in the cluster. Changes in cluster membership must result in a membership event that can be monitored by cluster services, applications, and middleware that register to be notified of membership events.

Fast node failure detection must not depend on a failing node reporting that the node is failing. However, self-diagnosis may be leveraged to speed up failure detection in the cluster.

This requirement does not address the issue of how to prevent failing nodes from accessing shared resources (see CFH.3.0 Application Fail-Over Enabling).

Fast node failure detection shall include the following capabilities:

• Ability to provide cluster membership health monitoring through cluster communication mechanisms.

• Support for multiple, redundant communication paths to check the health of cluster nodes.

• Support for fast failure detection. The guideline is a maximum of 250ms for failure detection. Since there is tradeoff between fast failure detection and potentially false failures, the health-monitoring interval must be tunable.

• Ability to provide a cluster-membership change event to middleware and applications.

Cluster node failure detection must use only a small percentage of the total cluster communication bandwidth for membership health monitoring. The guideline is that the bandwidth used by the health monitoring mechanism shall be linear with respect to the number of bytes per second per node.

Page 14: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

10

ID Name Category

CFH.2.0 Prevent Failed Node From Corrupting Shared Resources

Clustering – Fault Handling

Description: OSDL CGL specifies that carrier grade Linux shall provide a way to fence a failed or errant node from shared resources, such as SAN storage, to prevent the failed node from causing damage to shared resources. Since the surviving nodes in the cluster will want to failover resources, applications, and/or middleware to other surviving nodes in the cluster, the cluster must make sure it is safe to do the failover.

Killing the failed node is the easiest and safest way to protect shared resources from a failing node. If a failing node can detect that it is failing, the failing node could kill itself (suicide) or disable its ability to access shared resources to augment the node isolation process. However, the cluster cannot depend on the failing node to alter the cluster when it is failing, so the cluster must be proactive in protecting shared resources. External Specification Dependencies:

This requirement is dependent on hardware to provide a mechanism to reset or isolate a failed or failing node.

ID Name Category

CFH.3.0 Application Fail-Over Enabling Clustering – Fault Handling

Description: OSDL CGL specifies that carrier grade Linux shall provide mechanisms for failing over applications in a cluster from one node to another. Applications and nodes are monitored and a failover mechanism is invoked when a failure is detected. Once a failure is detected, the application failover mechanism must determine which policies apply to this failover scenario and then begin the process to start a standby application or initiate the re-spawn of an application within 1 second.

Note: The full application failover time is dependent upon application and node failure detection, the time to apply the failover policies, and the time it takes to start or restart the application. The aggregate failover time for an application must allow the cluster to maintain carrier grade application availability.

ID Name Category

CSM.1.0 Storage Network Replication Category – Shared Storage Management

Description: OSDL CGL specifies that carrier grade Linux shall provide a mechanism for storage network replication. The storage network replication shall provide the following:

• A network replication layer that enables RAID-1-like disk mirroring, using a cluster-local network for multi-node replication of data.

• Multi-node resynchronization of replicated data after node failure and recovery such that replicated data remains available during resynchronization.

ID Name Category

CSM.2.0 Cluster-aware Volume Management for Shared Storage

Clustering – Shared Storage Management

Page 15: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

11

Description: OSDL CGL specifies that carrier grade Linux shall provide management of logical volumes on shared storage from different cluster nodes. Volumes in such an environment are usually on physical disks accessible to multiple nodes. Volume management shall include the following:

• Enabling remote nodes to be informed of volume definition changes.

• Providing consistent and persistent cluster-wide volume names.

• Managing volumes from different cluster nodes consistently.

• Providing support for the striping and concatenation of storage. Clustered mirroring of shared storage is not included in this requirement (see CSM.3.0 Shared Storage Mirroring).

ID Name Category

CSM.3.0 Shared Storage Mirroring Clustering – Shared Storage Management

Description: OSDL CGL specifies that carrier grade Linux shall provide cluster-wide data mirroring with shared storage. Each node in the cluster with access to this shared storage must have the same view of mirrored storage. Shared storage must be able to be managed by any node in the cluster. Mirrored shared storage can be provided by either a hardware RAID-1 architecture storage subsystem or by a software RAID-1 implementation.

ID Name Category

CSM.4.0 Redundant Cluster Storage Path Clustering – Shared Storage Management

Description: CGL specifies that Linux shall provide each cluster node the ability to have redundant access paths to the shared storage.

CGL Availability Requirement: AVL.7.1 Multi-Path Access To Storage

ID Name Category

CSM.6.0 Cluster File System Clustering – Shared Storage Management

Description: OSDL CGL specifies that carrier grade Linux shall provide a cluster-wide file system. A clustered file system must allow simultaneous access to shared files by multiple computers. Node failure must be transparent to file system users on all surviving nodes. A clustered file system must provide the same user API and semantics as a file system associated with private, single-node storage.

Page 16: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

12

ID Name Category

CCM.2 Cluster Communication Service Clustering – Communication and Messaging

Description: OSDL CGL specifies that carrier grade Linux shall provide a reliable and high throughput inter-process and inter-processor communication service. This service will provide the communication infrastructure for the cluster and provide support for designing platform-independent, scalable, distributed, highly available, and high performance communications applications.

Some examples of features of this service include node location transparency, support for connection-oriented and connectionless modes, detecting a connection failure and acting on it, providing a socket-based interface, providing fast communication, providing QoS features, and support for a Subscribe/Publish model for IPC events.

ID Name Category

CCM.2.1 Cluster Communication Service – Logical Addressing

Clustering – Communication and Messaging

Description: OSDL CGL specifies that carrier grade Linux shall provide a cluster communication service with a socket-based interface that provides logical addressing for point-to-point and multipoint communication. The communication service must hide the physical topology of the cluster from application programs with this logical addressing scheme. Mapping between logical and physical addresses must be performed transparently. In addition, there must be no user-level distinction between inter- and intra-node communications or between user-space and kernel-space messages. Connection-oriented and connectionless modes must be supported.

ID Name Category

CCM.2.2 Cluster Communication Service – Fault Handling

Clustering – Cluster Communication and Messaging

Description: OSDL CGL specifies that carrier grade Linux shall provide a reliable communication service that detects a connection failure, aborts the connection, and reports the connection failure. An established connection must react to and report a problem to the application within 250 ms upon any kind of service failure, such as a process or node crash.

The connection failure detection requirement must offer controls that allow it to be tailored to specific conditions in different clusters. An example is to allow the specification of the duration of timeouts or the number of lost packets before declaring a connection failed.

ID Name Category

CCM.3.0

Redundant Cluster Communication Path Clustering – Communication and Messaging

Description: CGL specifies that Linux shall provide each cluster node the ability to have redundant communication paths to other cluster nodes and for these paths to appear as a single interface to an application.

CGL Availability Requirement: AVL.7.3 Redundant Communication Paths

Page 17: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

13

ID Name Category

CAF.2 Highly Available Network Clustering – Availability Framework

Description: CGL specifies that Linux shall provide stateful failover for network and session connections.

ID Name Category

CAF.2.1 Ethernet MAC Address Takeover Clustering – Availability Framework

Description: OSDL CGL specifies a mechanism to program and announce MAC addresses on Ethernet interfaces so that when a SW Failure event occurs, redundant nodes may begin receiving traffic for failed nodes.

ID Name Category

CAF.2.2 IP Takeover Clustering – Availability Framework

Description: OSDL CGL specifies a mechanism to program and announce IP addresses (using gratitutous ARP) so that when a SW Failure event occurs, redundant nodes may begin receiving traffic for failed nodes.

ID Name Category

CCS.2.0 SAF-AIS Data Checkpointing Performance Clustering – Checkpoint Service

Description: OSDL CGL specifies that CGL shall provide the SA Forum AIS checkpointing service that meets specified performance requirements.

The “checkpoint read” API must be capable of being executed at least 500 times per second with a total data size of at least 2048 (2K) bytes per API execution.

The “checkpoint write” API must have the same throughput capabilities (at least 500 API executions of at least 2K bytes). All checkpoint replicas must be updated before the “checkpoint write” API returns.

ID Name Category

CMS.2.0 Dynamic Cluster Membership Clustering – Membership Service

Description: OSDL CGL specifies that carrier grade Linux shall provide the ability to add a new node (with new IP address) to the cluster dynamically without pre-configuration as part of membership.

Page 18: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

14

4.3 Cluster Management This section describes requirements that provide the ability for the management console to manage all of the nodes in a cluster with respect to cluster node monitoring, cluster node control, and cluster node diagnostics.

ID Name Category

CMON.1 Cluster Node Monitoring Clustering – Management (monitoring)

Description: OSDL CGL specifies that carrier grade Linux shall provide the ability for the management console to monitor the availability of all nodes in the cluster.

ID Name Category

CMON.1.1 Cluster Node HW Status Monitoring Clustering – Management (monitoring)

Description: OSDL CGL specifies that carrier grade Linux shall conform with the HPI standard to provide the ability for the management console to isolate the hardware failure of a cluster node.

Links to Other Specifications

CGL Standards Requirements Definition:

• STD.8.8 SA Forum HPI

ID Name Category

CMON.1.2 Cluster Events for Node Status Changes Clustering – Management (monitoring)

Description: OSDL CGL specifies that carrier grade Linux shall provide the ability for the management console to detect when the membership changes in a cluster.

ID Name Category

CMON.1.3 Cluster-Wide Resource Monitor Clustering – Management (monitoring)

Description: OSDL CGL specifies that carrier grade Linux shall provide a means to access cluster resources from a centralized location to facilitate easy analysis of the performance of the whole cluster and collection of statistics.

Page 19: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

15

ID Name Category

CDIAG.2 Cluster-Wide Diagnostic Info Clustering – Management (diagnostics)

Description: OSDL CGL specifies that carrier grade Linux shall provide cluster-aware mechanisms to generate and retrieve software diagnostic information such as core dump and crash dump information.

ID Name Category

CDIAG.2.1 Cluster-Wide Identified Application Core Dump

Clustering – Management (diagnostics)

Description: OSDL CGL specifies that carrier grade Linux shall provide a cluster-aware application core dump that uniquely identifies which node produced the core dump. For instance, if a diskless node dumps core files to network storage, the core dump will be uniquely identified as originating from that node.

ID Name Category

CDIAG.2.2 Cluster-Wide Kernel Crash Dump Clustering – Management (diagnostics)

Description: OSDL CGL specifies that carrier grade Linux shall provide a cluster-aware kernel crash dump that uniquely identifies which node produced the crash dump. For instance, if a diskless node dumps crash data to network storage, the data will be uniquely identified as originating from that node.

ID Name Category

CDIAG.2.3 Cluster -Wide Log Collection Category – Management (diagnostics)

Description: OSDL CGL specifies that carrier grade Linux shall provide a cluster-wide logging mechanism. A cluster-wide log shall contain node identification, message type, and cluster time identification. This cluster-wide log may be implemented as a central log or as the collection of specific node logs.

ID Name Category

CDIAG.2.4 Synchronized/Atomic Time Across Cluster Clustering – Management (diagnostics)

Description: OSDL CGL specifies that carrier grade Linux shall provide cluster wide time synchronization within 500mS, and must synchronize within 10 seconds once the time synchronization service is initiated.

In a cluster, each node must have be synchronized to the same wall-clock time to provide consistency in access times to shared resources (i.e. clustered file system modification and access times) as well as time stamps in cluster-wide logs.

Page 20: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

16

5 Clustering Roadmap

5.1 Service Availability Forum (SA Forum) APIs ID Name Category

CCM.1.0 Cluster Message Service (SA Forum AIS)

Clustering – Communication and Messaging

Description: OSDL CGL specifies that carrier grade Linux shall provide a cluster messaging service that conforms to the SA Forum Application Interface Specification (AIS).

The SAF-AIS message service is a buffered message-passing system based on the concept of message queues and queue groups. Message delivery in a cluster is based on priority. The message service guarantees that messages are not duplicated and are persistent in queues until the message is received. A process that is sending messages can request the message service to notify the process whether or not the message was sent successfully. The message service depends on the membership service and uses the membership service to determine active cluster nodes (nodes that can send and receive messages).

5.2 Functional Requirements ID Name Category

CSM.5.0 Cluster Synchronized Device Hotswap Clustering – Shared Storage Management

Description: OSDL CGL specifies that carrier grade Linux must have the ability to perform online insertion and/or online deletion (hotswap) of shared devices in a cluster. A hotswap indicator, or device light, should not be activated when a shared device is prepared for removal until all operating systems that have access to the shared device have removed references to the device. Each node in the cluster must maintain a consistent cluster-wide view of shared devices.

Example architectures for cluster-synchronized device hotswap are Advanced Telecom Computing Architecture (ATCA) and Compact Peripheral Component Interface (cPCI).

ID Name Category

CSM.7.0 Shared Storage Consistent Access Clustering – Shared Storage Management

Description: OSDL CGL specifies that carrier grade Linux shall provide a consistent method to access shared storage from different nodes to ensure partition information isn't changed on one node while a partition is in use on another node that would prevent the change.

Page 21: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

17

ID Name Category

CCM.2 Cluster Communication Service Clustering – Communication and Messaging

Description: OSDL CGL specifies that carrier grade Linux shall provide a reliable and high throughput inter-process and inter-processor communication service. This service will provide the communication infrastructure for the cluster and provide support for designing platform-independent, scalable, distributed, highly available, and high performance communications applications.

Some examples of features of this service include transparency towards the cluster physical topology, support for connection-oriented and connectionless modes, detecting a connection failure and acting on it, providing a socket-based interface, providing fast communication, providing QoS features, and support for a Subscribe/Publish model for IPC events.

ID Name Category

CCM.2.3 Cluster Communication Service – Quality of Service

Clustering – Communication and Messaging

Description: OSDL CGL specifies that carrier grade Linux shall provide a reliable communication service that guarantees in sequence non-replicating, uncorrupted, and loss-free message delivery in a connection-oriented mode. In sequence message delivery is configurable. In case of destination unavailability, an error indicator will be returned to the sender along with information to describe which messages could not be delivered. An additional configurable option is for such messages to be returned to the sender.

ID Name Category

CCM.2.4 Cluster Communication Service – Performance

Clustering – Communication and Messaging

Description: OSDL CGL specifies that carrier grade Linux shall provide a fast communication service when transferring data between nodes in a cluster. It shall provide throughput and latency performance improvements when compared to the performance of TCP mechanisms. The transport protocol should take advantage of the cluster-specific physical model and must provide stable and bounded transmission delays.

ID Name Category

CCM.2.5 Cluster Communication Service – Event Notification

Clustering – Communication and Messaging

Description: OSDL CGL specifies that carrier grade Linux shall provide a reliable communication service that can relay Inter-Process Communication (IPC) events, like failure events, to interested applications. IPC events are published by the communication service and applications that subscribe to these events are notified when an IPC event occurs.

Page 22: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

18

ID Name Category

CCM.4 Group Messaging Protocol Clustering – Communication and Messaging

Description: CGL specifies that Linux shall provide a group messaging protocol that allows users to send messages to a group, receive messages in a group, receive group membership changes, and provide certain guarantees for delivery. When available, hardware multicast shall be used when sending messages to avoid the overhead of reduplicating transmissions with more than one listener. When a network partitions, all requirements shall work with the same reliability.

It is very likely that AIS services would use this technology. Other clustering applications could also use this form of messaging.

ID Name Category

CCM.4.1 Group Messaging Agreed Ordering Clustering – Communication and Messaging

Description: CGL specifies that Linux shall provide a group messaging protocol that guarantees agreed ordering such that all group members receive messages in the same order when published from other members.

ID Name Category

CCM.4.2 Group Messaging Safe Ordering Clustering – Communication and Messaging

Description: CGL specifies that Linux shall provide a group messaging protocol that guarantees safe ordering such that all group members either receive a message, or no group members receive a message. The ordering guarantees provided by agreed ordering must be provided by safe ordering.

ID Name Category

CCM.4.3 Group Messaging Membership Delivery Guarantee

Clustering – Communication and Messaging

Description: CGL specifies that Linux shall provide a group messaging protocol that delivers all messages and membership changes in the same order to all cluster members. CGL further specifies that all processors agree to the cluster membership for each membership change.

ID Name Category

CMS.3.0 Single Node Multiple Clusters Clustering – Membership Service

Description: OSDL CGL specifies that carrier grade Linux shall provide the ability for a node to be part of multiple instances of clusters that have their own clustering membership.

Page 23: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

19

ID Name Category

CAF.2 Highly Available Network Clustering – Availability Framework

Description: See description in Clustering Requirements section above.

ID Name Category

CAF.2.3 Deliberate TCP Session Takeover Clustering – Availability Framework

Description: OSDL CGL specifies a mechanism to synchronize TCP sockets, buffer structures, and sequence numbers so that redundant nodes may take over TCP sessions originated on other nodes. A deliberate TCP session takeover assumes that TCP session(s) are transferred deliberately and not as the result of unexpected node failure(s).

ID Name Category

CAF.2.4 TCP Session Takeover on Node Failure Clustering – Availability Framework

Description: OSDL CGL specifies a mechanism to synchronize TCP sockets, buffer structures, and sequence numbers so that when a critical resource fails, such as a CPU, memory, or kernel, a redundant node may take over TCP sessions originated on the failed node. Note that when the TCP session(s) are assumed by a redundant node, the sessions will resume from the last checkpoint. TCP traffic should continue even if there is a conflict between the last TCP state of the failed node and the checkpointed TCP state on the redundant node.

5.3 Cluster Management ID Name Category

CMON.1 Cluster Node Monitoring Clustering – Management (monitoring)

Description: See description in Clustering Requirements section above.

ID Name Category

CMON.1.4 Cluster-Wide Application Monitor Clustering – Management (monitoring)

Description: OSDL CGL specifies that carrier grade Linux shall provide a means to monitor an application in a cluster. This is to facilitate on-demand health analysis of applications running in the cluster and to respond to application failures with administrator-specified actions.

Page 24: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

20

ID Name Category

CCON.1 Management Control Clustering – Management (control)

Description: OSDL CGL specifies that carrier grade Linux shall provide the ability for the management console to remotely control activities such as running diagnostics, updating software, and rebooting the nodes in a cluster.

ID Name Category

CCON.1.1 Run Diagnostics Clustering – Management (control)

Description: OSDL CGL specifies that carrier grade Linux shall provide the ability for the management console to remotely perform a diagnostic on a cluster node.

ID Name Category

CCON.1.2 Boot/Reboot nodes Clustering – Management (control)

Description: OSDL CGL specifies that carrier grade Linux shall provide the ability for the management console to remotely boot or reboot any node in the cluster.

The ability to boot/reboot a cluster node must conform to the HPI standard.

Links to Other Specifications

CGL Standards Requirements Definition:

• STD.8.8 SA Forum HPI

ID Name Category

CCON.1.3 SW Upgrades Clustering – Management (control)

Description: OSDL CGL specifies that carrier grade Linux shall provide the ability for the management console to remotely perform the upgrade of the software on a node. Cluster rolling upgrades will depend on protocol and API compatibility in the software stack.

Links to Other Specifications

CGL Serviceability Requirement: SVC.2.1 – Remote Package Update and Installation

Page 25: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

21

ID Name Category

CCON.1.4 SW Rolling Upgrades Clustering – Management (control)

Description: OSDL CGL specifies that carrier grade Linux shall provide the ability for the management console to remotely perform the upgrade of the software stack on a node. In addition, the cluster must continue to function with the upgraded software stack with compatible protocols and formats until all the nodes in the cluster have been upgraded.

ID Name Category

CDIAG.1 Online Diagnostics Clustering – Management (diagnostics)

Description: OSDL CGL specifies that carrier grade Linux provide the ability for the management console to remotely perform online diagnostic functions to a node in the cluster.

Sample usage scenario: The application server fails and then fails over to a hot standby node and continues to provide service. The management console receives a notification of node failure. The operator reviews the cluster logs and decides to reboot the node, but to keep the node out of the cluster. The management console operator initiates an online diagnostic on the node that failed and determines that a network card failed an online test. The node is kept out of the cluster while a field technician is dispatched to repair the node.

ID Name Category

CDIAG.1.1 Online Diagnostics For Fans and Power Supplies

Clustering – Management (diagnostics)

Description: OSDL CGL specifies that carrier grade Linux provide the ability for the management console to remotely perform online diagnostic functions to a node in the cluster to diagnose fans and redundant power supplies.

ID Name Category

CDIAG.1.2 Online Diagnostics For System Components

Clustering – Management (diagnostics)

Description: OSDL CGL specifies that carrier grade Linux provide the ability for the management console to remotely perform online diagnostic functions on a node in the cluster to diagnose system components such as CPUs, memory, interface cards, disks, and disk subsystems.

Page 26: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

22

Appendices

A.1. Clustering References • Birman, Kenneth P. 1997. Building Secure and Reliable Network

Applications. Manning Publishing Company and Prentice Hall.

• Birman, Ken, et al (circa 2000). “The Horus and Ensemble Projects: Accomplishments and Limitations.”

• Chandra, Tushar, Vassos Hadzilacos, Sam Toueg. June 1996. “The Weakest Failure Detector for Solving Consensus”.

• Davis, Roy G. 1993. VAX Cluster Principles. Digital Press.

• Dolev, Danny, and Dalia Malki. 1996. “The Transis Approach to High Availability Cluster Communication.” Comm. of the ACM 39 (April): 64-70.

• Pfister, Greg. 1998. “In Search of Clusters”, Second Edition, Prentice Hall PTR.

• Simmons, Chuck, and Patty Greenwald. 1994. “Oracle Lock Manager Requirements,” Oracle Corporation.

• Thomas, Kristin. 2001. “Programming Locking Applications,” IBM Corporation.

• van Renesse, Robbert, Kenneth P. Birman, and Silvano Maffeis. 1996. “HORUS: A flexible Group Communication System.” Comm. of the ACM 39 (April): 76-83.

• Service Availability Forum: http://www.saforum.org/

• Open Cluster Framework: http://www.opencf.org

The following references discuss virtual synchrony:

• Extended Virtual Synchrony: http://www.cs.jhu.edu/~yairamir/dcs-94.ps

• The Totem Single-Ring Ordering and Membership Protocol: http://citeseer.ist.psu.edu/rd/95751215%2C113197%2C1%2C0.25%2CDownload/http://citeseer.ist.psu.edu/cache/papers/cs/686/http:zSzzSzwww.cs.jhu.eduzSz%7EyairamirzSztocs.pdf/amir95totem.pdf

• Birman, Kenneth.1987. "Exploiting virtual synchrony in distributed systems"

The following cluster-related whitepapers can be found at http://developer.osdl.org/cherry/cluster-whitepapers/.

• OSDL Cluster Architecture (OSDL-cluster.html)

• Carrier Grade Linux Clustering Model (cluster_alcatel.doc)

Page 27: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

23

• Ericsson Clustering Model Proposal (cluster_ericsson.pdf)

• The Telecom System View (cluster_intel.pdf)

• Foundational Components of Service Availability (cluster_mv.pdf)

• NTT Clustering Model (cluster_ntt.pdf)

A.2. Definition of Terms [ ] indicates a term that is defined elsewhere in the definitions of terms.

* Application

A set of [processes], running on a computer [system], that provides a service to the [users] of this [system]. An application is usually referred to as the non operating system portion of the software in a [system].

* Availability

Availability is the amount of time that a [system] [service] is provided in relation to the amount of time the [system] [service] is not provided. [System] [service] downtime could be the result of [system] [failures] (unscheduled downtime) or for things like upgrades, system relocation, or backups (scheduled downtime). A [system] [service] is provided if the [service] is functioning at an acceptable level of [performance] or [scalability]. Availability is commonly expressed as a percentage (see [five-nines] or [six-nines]).

Percent Availability = (time service is provided / total time) X 100

* Cluster

Two or more computer [nodes] in a [system] used as a single computing entity to provide a [service] or run an [application] for the purpose of [high availability], [scalability], and distribution of tasks.

* Communication The exchange of information between [processes]. These [processes] can be running on the same [node] (intra-node) or on different [nodes] (inter-nodes). The information includes [events] and [messages].

* Data

Numerical or other information represented in a form suitable for processing by a [process].

* Data Checkpointing

The mechanism by which [application] state is transmitted from an active [service unit] to one or more standby [service units].

Page 28: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

24

* Event

A [communication] with or without data which notifies a set of zero or more [processes] that something took place. This communication can be either within a [node] and/or between [nodes].

* Event service

A publish/subscribe event service that manages [events]. [Events] may be grouped into named channels and handle attributes such as priority, ordering, retention times, and persistence. A [subscriber] informs the event mechanism that it wishes to receive a certain event. A [publisher] posts an event to the event mechanism to be delivered to all [subscribers] of that event. This way the [publisher] and [subscriber] are decoupled, they do not have to directly know about each other, just about the event. Events may be asynchronous or synchronous. A [publisher] posting a synchronous event will block or be informed when all [subscribers] have received the event. The [publisher] of an asynchronous event will not block waiting for delivery or be informed when the event is delivered to any [process].

* Failback

The process to migrate back to a [node] after it has been [repaired]. It can be controlled or automatic.

* Failover

The ability to automatically switch a [service] or capability to a [redundant] [node], [system], or [network] upon the [failure] or abnormal termination of the currently-active [node], [system], or [network].

* Failure

The inability of a [system] or [system] component to perform a required function within specified limits. A failure may be produced when a [fault] is encountered. Examples of failures include invalid data being provided, slow response time, and the inability for a [service] to take a request. Causes of failure can be hardware, firmware, software, network, or anything else that interrupts the [service].

* Failure Detection

A failure is ultimately caused by an unmasked [fault] in the [system]. Failure detection is the process, usually from external view, to detect a [failure] of the [service] the [system] is providing.

* Fault

An error in a computer [system] or the [service] it provides. A fault may be masked and not impact the [application] or the [service] it provides. A fault can also be classified as transient or permanent. A fault is often associated with a [system] defect in the software or hardware. A fault can be caused by external stimulus to the [system].

Page 29: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

25

* Fault Confinement

Equivalent to [fault isolation].

* Fault Detection

Ability to detect an abnormal condition (device failure, temperature error, etc.) in the [system].

* Fault Diagnosis

The localization of a [fault] to its repair unit.

* Fault Isolation

Ability to protect the rest of the [system] from the effects of a [fault].

* Fault Prediction

Detecting or forecasting [faults].

* Fault Tolerance

Ability for a [system] to mask a set of [failures] from impacting the [service] it provides.

* Five-nines Five-nines is measured as 99.999% [service] [availability]. It is equivalent to 5 minutes a year of total planned and unplanned downtime of the [service] provided by the [system].

* Group Multicast

The sending of a single [message] to a set of destination [processes].

* Hand-over

Equivalent to [switch-over]

* Lock Service

The lock [service] is a distributed lock [service], suitable for use in a [cluster], where [processes] in different [nodes] might compete with each other for access to shared resources. A lock [service] may provide the following capabilities: exclusive and shared access, synchronous and asynchronous calls, lock timeout, trylock, deadlock detection, orphan locks, and notification of waiters.

* Message

A [communication] with [data] in a form suitable for transmission. A message may contain attributes of the [communication] such as source, destination, time stamps, and authorization information, etc. It may also contain [application] specific information.

* MTTF

Mean Time To [Failure]. The interval in time which the [system] can provide [service] without [failure].

Page 30: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

26

* MTTR

Mean Time To [Repair]. The interval in time it takes to resume [service] after a [failure] has been experienced.

* Network

A connection of [nodes] which facilitates [communication] among them. Usually, the connected nodes in a network use a well defined [network protocol] to communicate with each other.

* Network Protocols

Rules for determining the format and transmission of data. Examples of network protocols include TCP/IP, UDP, etc.

* High Availability

The state of a [system] having a very high ratio of [service] uptime compared to [service] downtime. Highly available systems are typically rated in terms of number of nines such as [five-nines] or [six-nines].

* Node

A single computer unit, in a [network], that runs with one instance of a real or virtual operating system.

* Node membership

The mechanism by which computer [nodes] join and leave a cluster as well as the mechanism to detect [node] [failure]. A [node] is deemed to be a member if it has joined the [cluster] successfully. A [node] is deemed to be a non-member if it has not joined the cluster or if it has left the cluster. A detected [failure] may result in the [node] leaving the cluster or being isolated from the cluster, depending on node membership policy.

* Performance

The efficiency of a [system] while performing tasks. Performance characteristics include total throughput of an operation and its impact to a [system]. The combination of these characteristics determines the total number of activities that can be accomplished over a given amount of time.

* Process

A single instance of a software program running on a single [node].

* Process group

A collection of processes registered within [cluster] software.

* Process group membership

The mechanism by which [process] registration, un-registration, and [failure detection] is managed. A [process] is deemed to be a member if it has registered with the [process

Page 31: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

27

group] successfully. A [process] is deemed to be a non-member if it has not registered with the process group. A [detected] failure may cause the [process] to become a non-member, depending on the process group membership policy. A [process] can gracefully un-register to depart from the process group. The process group membership also handles authorization to join the membership. Process group membership depends upon [node membership] if process group membership is available on multiple [nodes]. Process group membership is used to execute application [failover] policy.

* Publisher

A [process] that sends [events].

* RAS

[Reliability], [availability], and [serviceability]

* Recovery

To return a failing component, [node] or [system] to a working state. A failing component can be a hardware or a software component of a [node] or [network]. Recovery can also be initiated to work around a [fault] that has been detected; ultimately restoring the [service].

* Redundancy

Duplication of hardware, software, or network components in a [system] to avoid [Single Points of Failure].

* Reliability

The continuation of [service] in the absence of [failure]. Reliability is commonly measured as the [MTTF] of a [system].

* Repair

The process to remove a [fault].

* Replication

A component, [node], or [system] which is configured identically to a base component, [node] or [system] for the purpose of [fault tolerance], [performance], or ease of [service].

* Scalability

How well a solution to some problem will work when the size of the problem increases. In the CGL context, the scalability is defined as the ability of a [system] to provide the same level of [high availability] performance when the work load of the [service] increases. The solution to increase the [system] or [service] scalability can be software or hardware oriented.

Page 32: Open Source Development Labs Carrier Grade Linux Clustering … › images › 7 › 77 › Cgl... · 2017-11-14 · Carrier Grade Linux Clustering Requirements Definition Version

Carrier Grade Linux Clustering Requirements Definition Version 3.2

28

*Service

A set of functions provided by a computer [system]. Examples of communications services include media gateway, signal, or soft switch types of applications. Some general examples of services include web based or database transaction types of applications.

* Service Unit

A collection of one or more software [processes] that provide [service] to a [user].

* Serviceability

The capability for a [system] to be maintained and updated. Often, serviceability is measured by how easy a maintenance task can be performed or how quickly a [system] [fault] can be tracked down and repaired so that the [system] can resume the [service].

* Single Point of Failure

Any component or [communication] path within a computer [system] that would result in an interruption of the [service] if it failed.

* Six-nines

Six-nines is measured as 99.9999% [service] [availability]. It is equivalent to 30 seconds a year of total planned and unplanned downtime of the [service] provided by the [system].

* Subscriber

A [process] that receives [events]. A [subscriber] may subscribe to one or many [events]. A subscriber may join and leave an event subscription at any time without involving the publishers.

* Switch-over

Ability to switch to a [redundant] [node], [system], or [network] upon a normal termination of the currently-active [node], [system], or [network]. Switch-over can happen with or without human intervention.

* System

A computer system that consists of one computer [node] or many nodes connected via a computer network mechanism.

* User

An external entity that acquires [service] from a computer [system]. It can be a human being, an external device, or another computer [system].