DCE High Availability White Paper - Equitrac … · Equitrac Office and Express DCE High Availability White Paper November 01, 2016 Version 1.1 When adding a new node to the cluster

© 2016 Nuance Communications, Inc. All rights reserved.

November 01, 2016Equitrac Office and Express DCE High Availability White PaperVersion 1.1

Office and Express DCE High Availability White PaperVersion 1.1


2November 01, 2016Equitrac Office and Express DCE High Availability White Paper

Version 1.1

Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5What is a High Availability Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Equitrac High Availability Core Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Role of the DCE Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Role of Couchbase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Role of the Network Load Balancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Setting up a Couchbase Cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Installing Couchbase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Creating Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Setting up a Load Balancer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Configuring a Virtual Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Creating a DNS Record . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Adding a Loopback Adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Adding the IP Address Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Distributing DCE Load and Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Distributing the Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Node Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15DCE Cache Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Failure Scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17DCE Node Failover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Enabling and Disabling a Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Deleting a DCE Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Adding a DCE Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Running a DCE Health Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Couchbase Node Failover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Removing a Failed Over Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Recovering a Failed Over Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Couchbase Node Fails Without Auto-Failover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Automatic Failover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Resetting the Automatic Failover Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Automatic Failover Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Couchbase Protections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Recommendations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Couchbase Cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25DCE Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25



Version 1.1

SummaryThis white paper applies to setting up DCE High Availability environment for Equitrac Office and Express 5.7.

High Availability (HA) is not a specific technology, but rather it refers to a system that is continuously operational and available without failure for a long time. HA environments are designed to provide full-time system availability and have redundant components that make the system fully available in the event of a failure. To avoid having single points-of-failure, any component that can fail has a redundant component of the same type with the capacity to carry the load put on it by the failed over component.

A network computer system consists of many parts in which all parts usually need to be present in order for the whole system to be operational. HA systems are built for availability, reliability and scalability. To ensure a component is available and operational, system monitoring is used to quickly detect if a component is available or has failed. A system is more reliable if it has the ability to continue functioning when part of the system fails, that is, it allows for automatic switching to another cluster component in the event of failure. HA systems rely on scalability to incrementally add components to an existing cluster to meet increasing performance demands.

The Equitrac HA print system is designed to allow the user to release print jobs at the multi-function devices (MFP) without interruption, even in the event of a failure between the MFP and Device Control Engine (DCE). The core components of the Equitrac HA environment are: the DCE for communication and user authentication between the MFP and the Equitrac Core Accounting Server (CAS); the Couchbase Server to distribute the DCE cache; and the Network Load Balancer (NLB) that distributes service requests across the servers in the cluster.

The DCE provides communication with the MFP and CAS. DCE communicates with CAS to verify user credentials, and forwards the job information generated by these devices for tracking in the accounting database. To ensure the user successfully releases their print job even when a failure occurs, the job request is rerouted to the redundant or backup component, and the failover and rerouting process is transparent to the user. DCE forwards print release requests to the Document Routing Engine (DRE)/Document Routing Client (DRC) nodes in the cluster.

The Couchbase Server is a NoSQL document database with a distributed architecture designed for performance, scalability, and availability. The Couchbase data store shares out the running information needed to allow any DCE node in an HA cluster to handle the requests from the client. Couchbase server replicates data across multiple nodes, ensuring that additional copies of the data are available in case of failure. Couchbase is installed on multiple servers in the cluster to ensure high availability, with a minimum of three Couchbase cluster nodes required to support automatic failover. It is recommend that Couchbase and DCE are installed on separate nodes so that the Couchbase or DCE servers can be independently serviced without effecting the solution.

The NLB uses multiple, independent, identically configured servers to distribute service requests across each of the servers in the cluster. This process reduces the risk of any single server becoming overloaded. This also eliminates server downtime because the servers can be individually rotated and serviced without taking the service offline. NLB is an effective way of increasing the availability of critical applications. When server failures are detected and a server fails over, the traffic is automatically distributed to servers that are still running. If a cluster nodes fails over, it is important that the other nodes have the capacity to handle the extra load placed on them by the dropped node.

In an Equitrac HA print environment, the NLB appliance sits between the MFP and multiple DCE servers, and listens on specific ports for requests from the MFP, and then decides which DCE server to send the requests to. In this setup, the workflow is uninterrupted by having multiple DCEs across different servers connected to an NLB to distribute the workload.

Not only does load balancing lead to high availability it also facilitates incremental scalability by gradually adding components to an existing cluster to meet increasing performance demands. NLB and Couchbase work together to distribute the load among multiple nodes in the DCE HA cluster. Couchbase clustering operates in a horizontal fashion, and there are no hierarchical structures involved. Couchbase scales linearly, in terms of increasing the storage capacity, performance and scalability. To increase the size of a cluster, you can add another node to further distribute the load.



Version 1.1

When adding a new node to the cluster or when a node is taken offline for maintenance, it is important to consider how it is brought back into the cluster as the node may become overloaded with connection requests. To avoid overloading the new node, consider configuring NLB features designed to delay or slowly ramp up the distribution of connections to a node that has become available. The amount of traffic is based on the ratio of how long the node has been available compared to the slow ramp time, in seconds.

The possibilities of failures must be considered when planning the size and capacity of the cluster. For example, if a DCE cluster node fails over, the NLB drops the node from the pool list of available DCEs, and the load increases on the remaining DCE nodes. It is important that the other DCEs have the capacity to handle the extra load placed on them by the dropped node.

When a failed node has been detected, it is important to consider the how it is brought back online. When a DCE node is brought back online, the node may become overloaded with connection requests. When an NLB is configured to use the Least Connection method, it selects the DCE service with the least number of active connections to ensure that the load of the active requests is balanced on the services. In the case of a newly-enabled node, the system sends all new connections to that node, since it has the fewest of active connections.To avoid overloading the DCE node when it is brought back online, slowly ramp up the number of connections until the node is fully operational.

A Couchbase automatic failover is the process in which a Couchbase node is removed immediately from the cluster as opposed to a manual removal and rebalancing. The Couchbase Cluster Manager continuously monitors node availability and can automatically failover a node if it determines that a node is unavailable. If the Couchbase cluster is not properly sized, the increased load could cause another node to slow down and appear to have failed, and then the Couchbase Cluster Manager automatically fails it over. This can lead to a cascading failure of multiple or all nodes in the cluster. To prevent cluster failure, the automatic failover is limited to a single node in the cluster. To avoid potential automatic failover issues, proper sizing and capacity planning of the cluster must include the possibilities of failures. The best practice is to use automatic failover, but the cluster must be sized correctly to handle unexpected node failures and allow replicas to take over and successfully handle the additional load.

Another situation that may lead to problems is if a network device fails and causes the network to split, or partition into smaller clusters of greater than one. Automatic failover is designed to failover only one node at a given time, preventing a two-node cluster from having both nodes fail in the case of a network partition, or having data corruption from multiple nodes trying to access the data.

In an Equitrac HA environment, there are numerous recommendations to ensure availability and scalability of a Couchbase cluster. A minimum of three Couchbase cluster nodes are recommended to service one or more DCE clusters to ensure high availability and automatic failover. Automatic failover is designed to failover only one node at a given time, thus preventing a cascading failure of multiple or all nodes in the cluster. It is best practice to install Couchbase and DCE on separate nodes for dedicated cache, and to keep all Couchbase nodes inside the same data center. The Equitrac HA solution is not designed for multi-site or multi-CAS failover. One Couchbase cluster of three nodes or more could support up to 10 DCE clusters, if they are on dedicated nodes.

There are several recommendations for optimal DCE clustering. It is recommended to run a DCE HealthCheck on the NLB to reset client connections to the failing DCE node when "service down" is detected. If a DCE cluster node fails over, the embedded clients can reconnect to an alternate DCE node in the cluster and continue the user session. Once the NLB detects a node failure it stops routing new client connection requests to the failing DCE node. Clients with existing connections to the failing node may have to wait for a connection timeout. An alternative is to configure the NLB to reset existing client connections as soon as the failure is detected, causing the client to request a new connection without waiting for a network timeout.

There is a recommended maximum of 600 supported devices in a single cluster with three DCE nodes due to current SQLite cache performance constraints on individual DCE nodes.



Version 1.1

IntroductionWhat is a High Availability Cluster

A high availability (HA) cluster consists of a number of components with a common monitoring and failover infrastructure with no single point-of-failure. The mutually independent and automatic monitoring functions on the connected systems, enable clear and immediate error detection.

HA environments are designed to provide full-time system availability. HA systems typically have redundant hardware and software that makes the system fully available in the event of a failure and helps distribute workload. To ensure that HA systems avoid having single points-of-failure, any hardware or software component that can fail has a redundant component of the same type.

Failures can be system faults, data errors and site outages. System faults can occur in the system itself (CPU, memory or power supply), the software (operating system, database or application), peripheral device error, power outage, network disconnection, and even operator error. These types of failures are considered unplanned downtime. Systems can be manually failed over as part of a planned downtime, such as software or hardware upgrades, and system maintenance.

When failures occur, the processes performed by the failed component are moved (or failed over) to the redundant or backup component, or absorbed by other similar nodes in the system. This process resets system-wide resources, recovers partial or failed transactions, and restores the system to normal as quickly and as seamlessly as possible. A highly available system is almost transparent to the users.

HA for Print Release allows users to release jobs at the MFP even in the event of a failure between the MFP and DCE. The user can still release jobs at an MFP when DCE or CAS fails after the user has already authenticated at the device.

Equitrac High Availability Core Components

The core components of an Equitrac HA print setup are the Device Control Engine (DCE), the Couchbase Server and the Network load balancer (NLB).

Role of the DCE ServerThe Device Control Engine (DCE) provides communication with multi-function printers (MFP) and the Equitrac Core Accounting Server (CAS). DCE communicates with CAS to verify user credentials, and forwards the print job information generated by these devices for tracking in the accounting database. In a non-HA environment, if the DCE server goes down, and communication between DCE and CAS is lost, users may not be able to be validated and complete any transactions until the connection to the DCE server is restored. DCE forwards print release requests to the Document Routing Engine (DRE)/Document Routing Client (DRC) nodes in the cluster.

Role of CouchbaseCouchbase Server is a NoSQL document database with a distributed architecture for performance, scalability, and availability. Couchbase is used in an Equitrac HA environment to distribute the DCE cache, and should be installed on multiple servers to ensure high availability. The Couchbase data store shares out the running information needed to allow any DCE node to handle the requests from an embedded client.

It is recommend that Couchbase and DCE are installed on separate nodes so that the Couchbase or DCE servers can be independently serviced without effecting the solution.



Version 1.1

Role of the Network Load BalancerNetwork load balancing (NLB) uses multiple independent servers that have the same setup, but do not work together as in a failover cluster setup. In a load balancing solution, several servers are configured identically and the load balancer distributes service requests across each of the servers fairly evenly. This process reduces the risk of any single server becoming overloaded. This also eliminates server downtime because the servers can be individually rotated and serviced without taking the service offline.

NLB is an effective way of increasing the availability of critical applications. When server failure instances are detected, they are seamlessly replaced when the traffic is automatically distributed to servers that are still running. Not only does load balancing lead to high availability it also facilitates incremental scalability by gradually adding components to an existing cluster to meet increasing performance demands. Network load balancing facilitates higher levels of fault tolerance within service applications.

In a highly available Equitrac print environment, the NLB appliance sits between the MFP and multiple DCE servers, and listens on specific ports for requests from the MFP, and then decides which DCE server to send the requests to. In this setup, the workflow is uninterrupted by having multiple DCEs distributed across different servers connected to an NLB to distribute the workload.

Example DCE High Availability workflow.



Version 1.1

Setting up a Couchbase ClusterThe Couchbase server is a NoSQL document database used to distribute the DCE cache, and is required when installing DCE. When DCE is selected during Standard installation, there is the option to include DCE in a high availability (HA) setup or not.

When installing DCE as a part of the HA setup, you must download and install the Couchbase server on multiple remote servers in your deployment before installing DCE. When setting up Couchbase, the administrator needs to configure the parameters for the Couchbase fields that the EQ DCE installer prompts for during installation that required for DCE remote cache connection. These parameters include the IP address/hostname for the Couchbase server nodes, data bucket names and Admin credentials. A data bucket refers to a container Couchbase uses to provide data persistence and data replication. Data stored in buckets is highly-available and reconfigurable without server downtime.

When installing DCE, the Couchbase server connection information, such as the IP address/hostname of the DCE virtual server, the connection string to the Couchbase database, and the administrator’s credentials on the Couchbase server node are required. The Couchbase node and bucket is validated on each DCE service start-up.

Any changes to the Couchbase server connection or administrator credentials can be done after installation by updating the Equitrac (Office/Express) installed program. Changes can also be made by modifying the connection string environment variable and credential registry keys and restarting the DCE service. Changes to the Couchbase nodes or data buckets can be made through the Couchbase console website.

Couchbase server replicates data across multiple nodes, ensuring that additional copies of the data are available in case of failure. At least three Couchbase cluster nodes are required to support automatic failover.

Ports 8091, 8093, and 9100-9105 are used by Couchbase. If these ports are in use by another process then DCE and Couchbase will not function correctly, and the install may fail.

Installing Couchbase

In an HA setup, the Couchbase server must be installed and configured before installing DCE. The Couchbase installation file can be downloaded from http://www.couchbase.com/nosql-databases/downloads. Select the version that best suits your deployment and server platform.

To install and configure Couchbase, do the following:

1 Download and run the Couchbase Server executable following the Installation Wizard prompts.

2 Open the Couchbase console, and click the Setup button.

https://support.microsoft.com/en-us/kb/2999226

http://www.couchbase.com/nosql-databases/downloads





Version 1.1

3 On the Configure Server screen, set the following:

a Determine the Database Path and Index Path.

b Enter the Server Hostname.

c Start a new cluster, or join an existing cluster.

• Select the Start a new cluster option when installing the first Couchbase cluster, and determine the Service you want enable on the node, and the amount of memory per service.

—Or—

• Select the Join a cluster now option when adding more Couchbase nodes to the cluster, enter the cluster IP Address, the Couchbase Administrator credentials and Services on the node.

d Click Next.

4 On the Sample Buckets screen, click Next to continue. The sample buckets are designed to demonstrate the Couchbase server features, and are not required for configuration.



Version 1.1

5 On the Create Default Bucket screen, click Next to use the defaults.

6 On the Notifications screen, optionally enable software notifications, and community updates, and click Next.

7 On the Configure Server screens, create an administrator account for the server, and click Next.

The Couchbase Console opens and is ready to use. After installation, Couchbase is set up with only one server node and one data bucket.



Version 1.1

To add or manage Server Nodes, Data Buckets and Cluster Settings, do the following:

1 Select the Server Nodes tab and click the Add Server button since failover requires at least two servers.

a Enter the Server IP Address, the cluster Administrator credentials and Services on the node.

b Ensure all services are selected.

c Click Add Server.

2 Repeat this process to add the desired number of server nodes required for your environment.

3 Click Rebalance and wait for rebalancing to complete.

4 Select the Data Buckets tab and click the Create New Data Bucket button, click Save to use the defaults.



Version 1.1

5 Select the Setting tab, and click the Cluster button, and then enter a Cluster Name and click Save. Leave the default values for the other values.

6 On the Settings tab, click the Auto-Failover tab and enable the auto-failover feature and provide the Timeout for how long (in seconds) a node is down before it fails-over.

At least three Couchbase cluster nodes are required for Auto-Failover. Automatic failover is designed to failover only one node at a given time, preventing a chain reaction failure of multiple or all nodes in the cluster. More Couchbase nodes in a cluster decreases the potential impact on users when failover occurs. The percentage of MFPs available for users to log into increases with the number of available Couchbase nodes before failover is complete.

By default, the Timeout is set to a 120 second delay before a node automatically fails over. This timeout can be increased, and it can be set to as low as 30 seconds, however, it is not recommended to decrease it below 120 seconds. The minimum time is needed by the software to perform multiple checks of that node’s status to confirm that the node is down. The longer minimum timeout helps prevent false positives, such as failover of a functioning but slow node or failover due to temporary network connection issues.



Version 1.1

Creating Indexes An index is a data-structure that provides quick and efficient means to query and access data. Couchbase indexes are not automatically created for HA installations, and need to be manually created on the Couchbase cluster nodes. The administrator must create indexes on multiple nodes within the Couchbase cluster. Indexes must be created on at least one node for DCE to function, and on at least two nodes to enable Couchbase Auto-Failover to allow for continued operation after failover. The increased number of indexes aims to ensure that at least one index is up after a failure. Creating indexes on additional nodes increases redundancy of the nodes, however, it also increases the memory requirements on these nodes.

After the Couchbase cluster and bucket are created, the cbinit.exe must be run to create indexes. The cbinit.exe file is in the C:\Program Files\Equitrac\Express or Office\Device Control Engine folder.

To create indexes on one Couchbase node, run the following command. Repeat this command for each node that will contain the indexes. All parameters are case-sensitive and must be typed as shown.

C:\Program Files\Equitrac\Express\DCE\cbinit.exe /h <hostname> /u <username> /p <password> /b <bucket> /n <node> /s <suffix>

Example for creating indexes on Couchbase node with IP address 10.1.2.3.

C:\Program Files\Equitrac\Express\Device Control Engine\cbinit.exe /h 10.1.2.3 /u Administrator /p ****** /b Nuance /n 10.1.2.3 /s node -10_1_2_3

The Couchbase connection string must be provided to each DCE node when installing Equitrac Office or Express. The string is in the form: Couchbase://server1,server2/bucketname. On startup the DCE service connects to the Couchbase cluster using one of the servers listed in the connection string and then bootstraps itself to discover the remainder of the Couchbase nodes. It is not required to add all Couchbase nodes into the connection string, however, it is a requirement that at least one of the nodes is accessible on startup for the DCE to establish a connection to the bucket. Couchbase discovers the remaining nodes.

Parameter Description

hostname The hostname of one of the Couchbase nodes in the cluster

username Name of the Couchbase administrator

password Password of the Couchbase administrator

bucket The name of the Bucket to create the index on. This is the bucket name used in the Couchbase con-nection string

node The name of the Node to create the index on. The list of server names in the cluster can be seen from the Couchbase console under Server Nodes

suffix Suffix to append to the index names. This is needed to create a unique index name across the entire Couchbase cluster to ensure that all indexes have different names. The suffix can be any value as long as the name does not match an existing index



Version 1.1

Setting up a Load BalancerWhen setting up a high availability DCE NLB, there are certain configuration requirements that must be met:

• The NLB appliance sits between the MFP and the DCE server, and requires TCP ports 2939 and 7627 for HP OXPd and TCP port 2939 for Ricoh and Lexmark to listen for requests from the MFPs.

• The MFP IP addresses must be able to pass through the NLB appliance without changing the MFP source IP address. The NLB must preserve the IP addresses of the MFP so that the DCE service sees the request originating from the individual MFP IP Address and not the NLB.

• The responses sent to MFPs use the Virtual IP (VIP). When setting up an NLB, the recommended load balancing method used is Layer 4 in a direct server return (DSR)/N-Path/direct routing configuration. Layer 4 load balancing uses information defined at the networking transport layer as the basis for deciding how to distribute the requests across a group of servers.

• HA DCE nodes need to be able to initiate connections to the MFP using TCP/UDP.

Configuring a Virtual Service

A virtual service is required on your NLB appliance for DCE with a Virtual IP (VIP) assigned to it. When setting up an NLB VIP, it is very important that the source IP Address is preserved. That is, the EQ DCE service must see the request originating from the individual MFP IP Address and not the NLB appliance.

The following configuration example is for an F5 NLB. Consult you NLB vendor for configuration and setup support.

• Configure the Virtual Service IP Address (VIP).

• Set the Ports to 2939 (for Lexmark and Ricoh) or 2939 and 7627 (for HP OXPd)

• Set the Protocol to TCP.

• Set the Load balancing Forwarding Method to Direct Routing. (layer 4/direct routing/direct server return/N-Path)

• Ensure the Persistent checkbox is not selected.

• Set the Check Port for server/service online to 2939.



Version 1.1

Creating a DNS RecordOn the DNS server, create a hostname and corresponding Host A (or Address) record for the virtual DCE that matches the Virtual IP (VIP) for the NLB. This hostname or A record is used to resolve the virtual DCE name to the NLB VIP allowing traffic to point to the NLB.

When installing DCE in HA mode, the installer prompts the user to supply a virtual server name. This virtual server name should match the DNS record.

When configuring the MFP devices to connect back to the highly available DCE, the DCE hostname/IP Address should be the VIP for the NLB and not the individual DCE hostname or IP Address. The DCE nodes and CAS need to be able to resolve the Virtual DCE Hostname to the VIP.

Adding a Loopback AdapterTypical Layer 4 NLB deployments require that all servers placed behind a load balanced service have primary and secondary network interface cards (NIC) configured. The primary NIC provides the server with a dedicated, full-time connection to a network. The secondary NIC does not physically connect to the network.

When clients request a service via the NLB appliance, they contact an IP Address/Hostname configured on the NLB appliance specifically to listen for requests for that service. This is the Virtual IP (VIP) of the NLB appliance. Since the NLB appliance forwards on these requests directly to the servers offering the service without altering the destination IP Address, the servers themselves must contain at least one NIC assigned with the same IP Address as the VIP. If they do not, then the request from the client is rejected as the servers assume that the request was not intended for them.

It is equally important that the secondary NIC added to each server does not actually connect to the production LAN. This ensures that when any client wishes to connect to the NLB appliance on its VIP, the servers with the secondary NIC also containing the VIP do not respond directly to the clients. This would initiate a direct connection between the client and the server and would avoid sending the traffic via the NLB appliance.

In order to avoid direct client to server connection, the majority of NLB appliance vendors advise to add the secondary NIC as a loopback adapter, as this is a virtual interface that does not physically connect to a network.

Adding the IP Address Variable If you are deploying an HA DCE infrastructure on servers that are also running DRE, additional configuration is required to correct certain undesired network behavior. Refer to the Equitrac Print Release High Availability Setup Guide for combined HA DRE/DCE configurations.



Version 1.1

Distributing DCE Load and ScalingDistributing the Load

The network load balancer (NLB) and DCE cache work together to distribute the load among multiple nodes in the DCE high availability cluster. In an NLB solution, several servers are configured identically and the load balancer distributes service requests across each of the servers in the cluster.

NLB is an effective way of increasing the availability of critical applications. When server failures are detected, the traffic is automatically distributed to server nodes that are still running. It is important to ensure that the remaining nodes in the cluster can handle the load. Not only does load balancing lead to high availability it also facilitates incremental scalability by gradually adding components to an existing cluster to meet increasing performance demands. Network load balancing facilitates higher levels of fault tolerance within service applications.

In a highly available Equitrac environment, the NLB appliance sits between the MFP and multiple DCE servers, and listens on specific ports for requests from the MFP, and then decides which DCE server to send the requests to. In this setup, the workflow is uninterrupted by having multiple DCEs distributed across different servers connected to an NLB to distribute the workload. The NLB delivers traffic by combining this network information with a load balancing algorithm such as round-robin or least connections.

Couchbase is used in an Equitrac HA environment to distribute the DCE cache, and is installed on multiple servers to ensure high availability. The Couchbase cluster is a collection of multiple Couchbase nodes connected together to provide a single, distributed data store. The Couchbase data store shares out the running information needed for any DCE node to handle the requests from an embedded client. Couchbase does not need to be installed on the same node as DCE, but rather it should be installed on separate nodes so that DCE nodes can be brought online or taken offline without effecting the data distribution process. The DCE cluster can continue operating through a Couchbase node failover, as the data is distributed across the entire cluster via Couchbase.

Node Scaling

Couchbase clustering is designed with a distributed architecture for performance, scalability, and availability. The clusters operate in a horizontal fashion, and there are no hierarchical structures involved. To increase the size of a cluster, you can simply add another node to further distribute the load. This means that Couchbase scales linearly, in terms of increasing the storage capacity, performance and scalability.

It is recommended that a minimum of three Couchbase nodes be provided to service one or more DCE clusters to ensure high availability and automatic failover. Automatic failover is designed to failover only one node at a given time, thus preventing a chain reaction failure of multiple or all nodes in the cluster.

One Couchbase cluster of three nodes or more can service more than one DCE cluster. However, this can only be done if the buckets are configured with unique names. One Couchbase cluster of three nodes or more could support up to 10 DCE clusters, if they are on dedicated nodes.



Version 1.1

Example Print Release High Availability Server Deployment.

DCE Cache Limitations

The cache of previously logged in users is not in the DCE cache with Equitrac 5.7. If CAS goes offline, it depends on the DCE the user is re-directed to by the NLB if the they can login or not. That is, if the user has not previously been authenticated and their credentials have not been cached locally at that particular DCE node which serviced the user’s request, then they will not be able to login to that DCE while CAS is offline.

Each DCE node stores certain data locally, and this local cache scales as the number of devices assigned to the DCE cluster increases. Currently, the performance of this local node cache limits the number of devices per DCE cluster to approximately 600 devices, while still allowing a reasonable cache rebuild time on average hardware.



Version 1.1

Failure ScenariosDCE Node Failover

If a DCE cluster node fails over, the network load balancer (NLB) drops the node from the pool list of available DCEs, and the load increases on the remaining DCE nodes. It is important that the other DCEs have the capacity to handle the extra load placed on them by the dropped node.

A failed over node can be temporarily removed and brought back into the cluster, or it can be completely removed from the pool of DCEs in the NLB, and replaced with a new node to redistribute the load across all DCEs. Whether a node is taken offline for maintenance or being replaced with a new node, it is important to consider how it is brought back online.

When a node is taken offline, and then brought back online, the node may become overloaded with connection requests. When an NLB is configured to use the Least Connection method, it selects the service with the least number of active connections to ensure that the load of the active requests is balanced on the services. In the case of a newly-enabled node, the system sends all new connections to that node, since it has the fewest of active connections.

To avoid overloading the new node, consider configuring the NLB features designed to delay or slowly ramp-up the distribution of connections to a node that has become available. The amount of traffic is based on the ratio of how long the node has been available compared to the slow ramp time, in seconds.

If your NLB appliance does not support a slow ramp-up feature then an alternative configuration might be to use round-robin and a startup delay on the node. With round-robin, new connections would not all be directed to the new node as they would be if configured to use the least connections method. This option causes the system to send a less-than-normal amount of traffic to a newly-enabled node for the specified amount of time.Once the service has been online for a time greater than the slow ramp time, the node receives a full proportion of the incoming traffic.

The following configuration examples are for an F5 NLB. Consult you NLB vendor for configuration and setup support.

Enabling and Disabling a NodeA node must be enabled in order to accept traffic. When a node is disabled, the system allows existing connections to time out or end normally. In this case, the node can accept new connections only if the connections belong to an existing persistence session. Disabling a node differs from a down node, in that a down node allows existing connections to time out, but accepts no new connections.

To enable or disable a DCE node, do the following:

1 On the Main tab, navigate to Local Traffic > Nodes. This displays a list of existing nodes.

2 Select the checkbox for the node you want to enable or disable.

3 Click the Enable or Disable button.



Version 1.1

Deleting a DCE NodeIf a DCE node is no longer used in a pool, it can be can deleted.

To delete a node, do the following:

1 On the Main tab, navigate to Local Traffic > Nodes. This displays a list of existing nodes.

2 Select the checkbox for the node you want to delete.

3 Click the Delete button.

Adding a DCE NodeTo add a DCE node, do the following:

1 Install the DCE node specifying the same VIP & Couchbase connection string as an existing DCE cluster.

2 Add the DCE node to the pool of DCE nodes on the NLB.

Refer to the Equitrac Office and Express Installation Guide for detailed procedures for installing DCE in an HA environment.



Version 1.1

Running a DCE Health CheckIf a DCE cluster node fails over, the embedded clients can reconnect to an alternate DCE node in the cluster and continue the user session. In order for this to happen the NLB must first detect that the DCE node is no longer in service. This can be accomplished by configuring an NLB health monitor for the DCE service.

The DCE service supports a “DCEHealthCheck” URL for active monitoring of the DCE service by the NLB. The following HTTPS check can be made to each DCE node on port 2939. This determines if DCE can respond to the request in a timely manner. The DCE HealthCheck monitor continually pings the DCE nodes and takes the node offline on failure.

Once the NLB detects a node failure it stops routing new client connection requests to the failing DCE node. Clients with existing connections to the failing node may have to wait for a connection timeout. An alternative is to configure the NLB to reset existing client connections as soon as the failure is detected, causing the client to request a new connection without waiting for a network timeout.

Request string: GET /DCEHealthCheck HTTP/1.1\r\nConnection: close\r\n

Expected response string: HTTP/1.1 200 OK

The following configuration example is for an F5 NLB. Consult you NLB vendor for configuration and setup support.

Couchbase can be configured to provide failover alerts to notify Administrator’s when a Couchbase node has failed.



Version 1.1

To set Email Alerts, do the following:

1 Open the Couchbase console, and click the Settings tab.

2 Click the Enable email alerts checkbox.

3 Set the Email Server settings, and the Sender and Recipients email addresses.

4 Select any combination of Available Alerts.

5 Click Save.



Version 1.1

Couchbase Node Failover

Failover is the process in which a Couchbase node is removed immediately from the cluster as opposed to a manual removal and rebalancing. There are three types of failovers: graceful, hard and automatic.

• Graceful failover removes a node from the cluster in an orderly and controlled manner, and is initiated from a stable state of the cluster. Graceful failover moves the data from the failed node to the remaining nodes, and it is an online operation with zero downtime. This type of failover is primarily used for planned maintenance of the cluster, for example, during a software or OS upgrade.

• Hard failover removes a Couchbase node quickly from the cluster when it has become unavailable or unstable. Hard failover removes the node and rebuilds the data from the replica copies on the remaining nodes. Hard failover is primarily used when there is an unplanned outage of a node.

• Automatic failover is the built-in ability to have the Cluster Manager detect and determine when a node is unavailable and then initiate a hard failover. Automatic failover is designed to only failover one couchbase node at a time and there must be at least three Couchbase nodes to do so automatically.

Removing a Failed Over NodeA failed over node can be manually removed from the cluster, It is important to ensure that the remaining nodes in the cluster can handle the load. To manually remove a Couchbase node, do the following:

1 Open the Couchbase Console and click the Server Nodes tab.

2 Click the Remove button for the desired node.

3 Click Remove on the Confirm Server Removal warning popup.

4 Click Rebalance. Pending Removal is listed beside the selected node to be removed.

5 When rebalancing is complete, the node has been removed from the list of Active Servers.



Version 1.1

Recovering a Failed Over Node A failed over node can be recovered and the cache rebuilt to continue operation. In the process of failing over a node, performing maintenance, adding the node back into the cluster, and rebalancing, the data is recovered via either Delta recovery mode or Full recovery mode.

Delta recovery allows a failed over node to be recovered by re-using the data on its disk and then resynchronizing the data based on the delta change. With delta recovery mode, Couchbase detects which data files are up-to-date and which are out-of-date, and then during rebalance, the existing data files on the failed over server node are retained and the out-of-date files are updated. The node can be re-added to the cluster and incrementally caught up with the data changes. Delta recovery is recommended as the cluster starts functioning with minimal downtime. This operation improves recovery time and network resource usage.

With full recovery mode, the data files are removed from the failed over server node and then, during rebalance, the node is initialized and populated with active and replica buckets. This operation ensures a clean disk, but there is increased downtime due to disk re-population.

To recover a failed over node, open the Couchbase Console, and select the Delta Recovery or Full Recovery option, and then click Rebalance.



Version 1.1

Couchbase Node Fails Without Auto-Failover If a Couchbase node is down and unavailable without Auto-Failover, there are different options to bring them back up within the Couchbase Console.

• Select the node that is down, click Rebalance, and wait for the cluster to rebalance.

—Or—

• Wait until the node is reachable and add it back to the cluster, by selecting the Delta Recovery or Full Recovery option, and then click Rebalance.

Automatic Failover

The Cluster Manager continuously monitors node availability and can automatically perform a hard failover of a node if it determines that a node is unavailable.

With automatic failover, the Cluster Manager detects a failed node, verifies that it failed, and then initiates the hard failover process. Automatic failover does not identify the cause or fix the issue with the failed node, it simply detects that the node failed. Once the issue is fixed, the administrator initiates a rebalance to return the cluster to a healthy state.

Automatic failover is not set by default in Couchbase, and must be enabled on the desired server nodes in the cluster. There must be at least three cluster nodes in order to enable automatic failover. Automatic failover is designed to failover a node only if that node is the only one down at a given time, and it fails over only one node before the a fix is required by the cluster administrator. This prevents a chain reaction failure of multiple or all nodes in the cluster.



Version 1.1

The Timeout is set for a 120 second delay before a node automatically fails over. This timeout can be increased, and it can be set to as low as 30 seconds, however, it is important to have enough time for the software to perform multiple checks of that node’s status to confirm that the node is down. The longer minimum timeout helps prevent false positives, such as failover of a functioning but slow node or failover due to temporary network connection issues.

Resetting the Automatic Failover CounterAfter a node automatically fails over, Couchbase increments an internal counter to notify the cluster that a node has failed over. The counter prevents the Cluster Manager from automatically failing over additional nodes until you identify and resolve the cause of the failover.

Reset the counter only after the node’s issue is resolved, and the cluster is rebalanced and restored to a fully functioning state. You can reset the counter with the Couchbase Console by clicking on the Reset Quota button.

Once the counter has been reset, another node could failover, therefore it is important that the counter is reset only when the criteria for automatic failover are met.

Automatic Failover ConsiderationsIf the cause of the failure cannot be identified, and the load that will be placed on the remaining systems is not known, an automatic failover might actually cause more problems than it solves. To avoid potential automatic failover issues, proper sizing and capacity planning of the cluster must include the possibilities of failures.

The best practice is to use automatic failover, but the cluster must be sized correctly to handle unexpected node failures and allow replicas to take over and successfully handle the additional load.

To prevent cluster failure, the automatic failover is limited to a single node in the cluster. The best practice is to use automatic failover, but ensure that the cluster has been sized correctly to handle unexpected node failures and allow replicas to take over. Continual capacity planning is critical to a healthy and correct performance of the Couchbase cluster.

Couchbase ProtectionsAutomatic recovery from component failures in a cluster involves some risks. You must be able to identify the cause of failure, and understand the load that will be placed on the remaining system when a failover occurs. This is why proper sizing and capacity planning of the cluster must include the possibilities of failures.

1 One situation that might lead to problems is known as Cascading Failure, where a single failed over node could cause

the other nodes to failover due to improper sizing resulting in a cascading chain reaction failure of multiple or all nodes

in the cluster. For example, you have a five-node Couchbase cluster running at 80% capacity on each node when a

single node fails and the software decides to automatically failover that node. Unless the cluster is properly sized, it is

unlikely that the four remaining nodes will be able to successfully handle the additional load, and additional nodes

automatically fail over until the entire cluster is down.



Version 1.1

2 Another situation that might lead to problems is known as Split Brain, where a network device fails and the network partitions into smaller clusters. To avoid a network partition, the automatic failover is restricted to failing over only one node at a given time. This prevents a network partition from causing two or more halves of a cluster from failing each other over and protects the data integrity and consistency.

A minimum of three Couchbase nodes per cluster are required to prevent a 2-node cluster from having both nodes fail in the case of a network partition. The partition of a 2-node cluster would create two single node Couchbase clusters with independently updated caches resulting the data integrity and inconsistency.

In the event of a network partition, automatic failover occurs only if a single node is partitioned out of the cluster and is failed over. If more than one node is partitioned off, automatic failover does not occur. After partitioning occurs, automatic failover must be reset. In the event that another node fails before the automatic failover is reset, no automatic failover occurs.

RecommendationsCouchbase Cluster• Install Couchbase and DCE on separate nodes for dedicated cache.

• Minimum three Couchbase nodes to ensure that high availability is possible, and to distribute the load to reduce the impact of node failure.

• Place indexes on every node to distribute the load and provide failover protection.

• Configure one Couchbase cluster for up to 10 DCE clusters.

• Keep all Couchbase nodes inside the same data center. This solution is not designed for multi-site or multi-CAS failover.

• Each Couchbase node recommends Quad-core CPU running at 2 GHz x 4 GB RAM.

• Couchbase RAM quota on bucket is 0.5 MB per embedded device, or 100 MB per 250 devices.

DCE Cluster• The number of DCE nodes = number of devices/300 + 1. This is sufficient to protect against cascade failure.

• Limiting 600 devices in a single cluster with three DCE nodes due to current SQLite cache performance constraints on individual nodes allows for rebuild in a sufficient downtime.

• Setup NLB health check and configure NLB to reset client connections to the failing DCE node when "service down" is detected. Set this so NLB can minimize subsequent login failures and timeouts, and in session embedded clients can recover quickly without waiting for connection timeouts.

• Configure NLB for delay on node 'UP' and/or slow ramp. A two minute delay per 600 devices, and a ramp up time of two minutes x the number of nodes in the DCE cluster.

Documents

DCE High Availability White Paper - Equitrac … · Equitrac Office and Express DCE High Availability White Paper November 01, 2016 Version 1.1 When adding a new node to the cluster