The HAS Architecture: A Highly Available and Scalable ...users.encs.concordia.ca/~gregb/home/PDF/haddad-phd-thesis.pdf · The HAS Architecture: A Highly Available and Scalable Cluster

The HAS Architecture: A Highly Available and

Scalable Cluster Architecture for Web Servers

Ibrahim Haddad

A Thesis in the Department of

Computer Science and Software Engineering

Presented in Partial Fulfillment of the Requirements

For the Degree of Doctor of Philosophy at

Concordia University

Montréal, Québec, Canada

March 2006

©Ibrahim Haddad, 2006

ii

CONCORDIA UNIVERSITY SCHOOL OF GRADUATE STUDIES

This is to certify that the thesis prepared By: _______________________________________________________________________ Entitled:____________________________________________________________________

_____________________________________________________________________

_____________________________________________________________________ and submitted in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY (Computer Science and Software Engineering)

complies with the regulations of the University and meets the accepted standards with respect to originality and quality. Signed by the final examining committee:

__________________________________________ Chair

__________________________________________ External Examiner

__________________________________________ External to Program __________________________________________ Examiner

__________________________________________ Examiner

__________________________________________ Thesis Supervisor

Approved by

_________________________________________ Chair of Department or Graduate Program Director

____________2006 _______________________ Dr. Nabil Esmail, Dean Faculty of Engineering and Computer Science

iii

Abstract

The HAS Architecture: A Highly Available and Scalable Cluster Architecture for Web Server

Ibrahim Haddad, Ph.D.

Concordia University, 2006

This dissertation proposes a novel architecture, called the HAS architecture, for scalable and highly

available web server clusters. The prototype of the Highly Available and Scalable Web Server

Architecture was validated for scalability and high availability. It provides non-stop service and is

able to maintain the base line performance of approximately 1000 requests per second per processor,

for up to 16 traffic processors in the cluster, achieving close to linear scalability. The architecture

supports dynamic traffic distribution using a lightweight distribution scheme, and supports connection

synchronization to ensure that web connections survive software or hardware failures. Furthermore,

the architecture supports different redundancy models and high availability capabilities such as

Ethernet and NFS redundancy that contribute in increasing the availability of the service, and

eliminating single points of failures.

This dissertation presents current methods for scaling web servers, discusses their limitations, and

investigates how clustering technologies can help overcome some of these challenges and enables the

design of scalable web servers based on a cluster of workstations. It examines various ongoing

research projects in the academia and the industry that are investigating scalable and highly available

architectures for web servers. It discusses their scope, architecture, provides a critical analysis of their

work, presents their advantages and drawbacks, and their contributions to this dissertation.

The proposed Highly Available and Scalable Web Server Architecture builds on current knowledge,

and provides contributions in areas such as scalability, availability, performance, traffic distribution,

and cluster representation.

iv

Acknowledgments

The work that has gone into this thesis has been thoroughly enjoyable largely because of the

interaction that I have had with my supervisors and colleagues. I would like to express my gratitude

to my supervisor Professor Greg Butler, whose expertise, understanding, and patience, added

considerably to my graduate experience. I appreciate his vast knowledge and skills in many areas, and

his encouragement that provided me with much support, guidance, and constructive criticism.

I would like to thank the other members of my committee, Professor J. William Atwood, Dr. Ferhat

Khendek, and Professor Thiruvengadam Radhakrishnan for the assistance they provided at all levels

of the project. The feedback I received from members of my committee as early as during my

doctoral proposal was very important and had influence on the direction of the work.

I also would like to acknowledge the support I received from Ericsson Research granting me

unlimited access to their remarkable research lab in Montréal, Canada.

I would also like to thank and express my gratitude to my wife, parents, brother, and sister for their

love, encouragement, and support.

Ibrahim Haddad

March 2006

v

Table of Contents Abstract .................................................................................................................................................iii Acknowledgments ................................................................................................................................. iv Table of Contents ................................................................................................................................... v List of Figures .....................................................................................................................................viii List of Tables.........................................................................................................................................xi Chapter 1 Introduction and Motivation .................................................................................................. 1

1.1 Internet and Web Servers ............................................................................................................. 1 1.2 The Need for Scalability............................................................................................................... 2 1.3 Web Servers Overview................................................................................................................. 3 1.4 Properties of Internet and Web Applications ............................................................................... 8 1.5 Study Objectives......................................................................................................................... 10 1.6 Scope of the Study...................................................................................................................... 11 1.7 Thesis Contributions................................................................................................................... 13 1.8 Dissertation Roadmap ................................................................................................................ 14

Chapter 2 Background and Related Work............................................................................................ 16 2.1 Cluster Computing ..................................................................................................................... 16 2.2 SMP versus Clusters................................................................................................................... 22 2.3 Cluster Software Components .................................................................................................... 23 2.4 Cluster Hardware Components................................................................................................... 23 2.5 Benefits of Clustering Technologies .......................................................................................... 23 2.6 The OSI Layer Clustering Techniques ....................................................................................... 26 2.7 Clustering Web Servers.............................................................................................................. 32 2.8 Scalability in Internet and Web Servers ..................................................................................... 36 2.9 Overview of Related Work......................................................................................................... 43 2.10 Related Work: In-depth Examination....................................................................................... 46

Chapter 3 Preparatory Work................................................................................................................. 65 3.1 Early Work ................................................................................................................................. 65 3.2 Description of the Prototyped Web Cluster................................................................................ 65 3.3 Benchmarking Environment....................................................................................................... 67 3.4 Web Server Performance............................................................................................................ 69 3.5 LVS Traffic Distribution Methods ............................................................................................. 70

vi

3.6 Benchmarking Scenarios............................................................................................................ 74 3.7 Apache Performance Test Results ............................................................................................. 74 3.8 Tomcat Performance Test Results ............................................................................................. 79 3.9 Scalability Results...................................................................................................................... 81 3.10 Discussion ................................................................................................................................ 83 3.11 Contributions of the Preparatory Work.................................................................................... 84

Chapter 4 The Architecture of the Highly Available and Scalable Web Server Cluster ..................... 85 4.1 Architectural Requirements ....................................................................................................... 85 4.2 Overview of the Challenges....................................................................................................... 87 4.3 The HAS Architecture ............................................................................................................... 88 4.4 HAS Architecture Components ................................................................................................. 91 4.5 HAS Architecture Tiers ............................................................................................................. 94 4.6 Characteristics of the HAS Cluster Architecture ....................................................................... 96 4.7 Availability and Single Points of Failures ................................................................................. 99 4.8 Overview of Redundancy Models............................................................................................ 102 4.9 HA Tier Redundancy Models .................................................................................................. 103 4.10 SSA Tier Redundancy Models............................................................................................... 107 4.11 Storage Tier Redundancy Models.......................................................................................... 109 4.12 Redundancy Model Choices .................................................................................................. 109 4.13 The States of a HAS Cluster Node......................................................................................... 111 4.14 Example Deployment of a HAS Cluster ................................................................................ 113 4.15 The Physical View of the HAS Architecture ......................................................................... 116 4.16 The Physical Storage Model of the HAS Architecture .......................................................... 118 4.17 Types and Characteristics of the HAS Cluster Nodes ........................................................... 123 4.18 Local Network Access ........................................................................................................... 126 4.19 Master Nodes Heartbeat......................................................................................................... 127 4.20 Traffic Nodes Heartbeat using the LDirectord Module ......................................................... 128 4.21 CVIP: A Cluster Virtual IP Interface for the HAS Architecture............................................ 130 4.22 Connection Synchronization .................................................................................................. 136 4.23 Traffic Management............................................................................................................... 140 4.24 Access to External Networks and the Internet ....................................................................... 149 4.25 Ethernet Redundancy ............................................................................................................. 150

vii

4.26 Dependencies and Interactions between Software Components ............................................ 151 4.27 Scenario View of the Architecture .........................................................................................155 4.28 Network Configuration with IPv6 .......................................................................................... 172

Chapter 5 Architecture Validation...................................................................................................... 176 5.1 Introduction .............................................................................................................................. 176 5.2 Validation of Performance and Scalability............................................................................... 176 5.3 The Benchmarked HAS Architecture Configurations.............................................................. 178 5.4 Test-0: Experiments with One Standalone Traffic Node ......................................................... 180 5.5 Test-1: Experiments with a 4-nodes HAS Cluster.................................................................... 183 5.6 Test-2: Experiments with a 6-nodes HAS Cluster.................................................................... 186 5.7 Test-3: Experiments with a 10-nodes HAS Cluster.................................................................. 188 5.8 Test-4: Experiments with an 18-nodes HAS Cluster................................................................ 191 5.9 Scalability Charts...................................................................................................................... 192 5.10 Validation of High Availability.............................................................................................. 194 5.11 HA-OSCAR Architecture: Modeling and Availability Prediction......................................... 199 5.12 Impact of the HAS Architecture on Open Source .................................................................. 204 5.13 HA-OSCAR versus Beowulf Architecture............................................................................. 205 5.14 The HA-OSCAR Architecture versus the HAS Architecture................................................. 207 5.15 HAS Architecture Impact on Industry.................................................................................... 210

Chapter 6 Contributions, Future Work, and Conclusion .................................................................... 212 6.1 Contributions ............................................................................................................................ 212 6.2 Future Work ............................................................................................................................. 220 6.3 Conclusion................................................................................................................................ 226

Bibliography....................................................................................................................................... 228 Glossary.............................................................................................................................................. 241

viii

List of Figures Figure 1: Web server components ......................................................................................................... 4 Figure 2: Request handling inside a web server.................................................................................... 5 Figure 3: Analysis of a web request....................................................................................................... 5 Figure 4: The SMP architecture ........................................................................................................... 17 Figure 5: The MPP architecture ........................................................................................................... 18 Figure 6: Generic cluster architecture .................................................................................................. 19 Figure 7: Cluster architectures with and without shared disks ............................................................ 19 Figure 8: A cluster node stack.............................................................................................................. 20 Figure 9: The L4/2 clustering model.................................................................................................... 26 Figure 10: Traffic flow in an L4/2 based cluster ................................................................................. 27 Figure 11: The L4/3 clustering model.................................................................................................. 28 Figure 12: The traffic flow in an L4/3 based cluster........................................................................... 29 Figure 13: The process of content-based dispatching – L7 clustering model ...................................... 30 Figure 14: A web server cluster ........................................................................................................... 32 Figure 15: Using a router to hide the web cluster ................................................................................ 33 Figure 16: Hierarchical redirection-based web server architecture ..................................................... 47 Figure 17: Redirection mechanism for HTTP requests........................................................................ 48 Figure 18: The web farm architecture with the dispatcher as the central component.......................... 51 Figure 19: The SWEB architecture...................................................................................................... 53 Figure 20: The functional modules of a SWEB scheduler in a single processor ................................. 54 Figure 21: The LSMAC implementation ............................................................................................. 56 Figure 22: The LSNAT implementation.............................................................................................. 56 Figure 23: The architecture of the IP sprayer....................................................................................... 58 Figure 24: The architecture with the HACC smart router.................................................................... 58 Figure 25: The two-tier server architecture.......................................................................................... 61 Figure 26: The flow of the web server router ...................................................................................... 62 Figure 27: The architecture of the prototyped web cluster .................................................................. 66 Figure 28: The architecture of the WebBench benchmarking tool ...................................................... 68 Figure 29: The architecture of the LVS NAT method ......................................................................... 71 Figure 30: The architecture of the LVS DR method............................................................................ 72 Figure 31: Benchmarking results of NAT versus DR.......................................................................... 73 Figure 32: Benchmarking results of the Apache web server running on a single processor ............... 75 Figure 33: Apache reaching a peak of 5,903 KB/s before the Ethernet driver crashes ....................... 75 Figure 34: Benchmarking results of Apache on one processor – post Ethernet driver update ............ 76 Figure 35: Results of a two-processor cluster (requests per second) ................................................... 77 Figure 36: Results of a four-processor cluster (requests per second) .................................................. 77 Figure 37: Results of eight-processor cluster (requests per second).................................................... 78 Figure 38: Results of Tomcat running on two processors (requests per second)................................. 79 Figure 39: Results of a four-processor cluster running Tomcat (requests per second)........................ 80 Figure 40: Results of an eight-processor cluster running Tomcat (requests per second).................... 80 Figure 41: Scalability chart for clusters consisting of up to 12 nodes running Apache....................... 82 Figure 42: Scalability chart for clusters consisting of up to 12 nodes running Tomcat....................... 82 Figure 43: The HAS architecture ......................................................................................................... 90 Figure 44: Built-in redundancy at different layers of the HAS architecture ...................................... 101 Figure 45: The process of the network adapter swap......................................................................... 102 Figure 47: The 1+1 active/standby redundancy model ...................................................................... 104

ix

Figure 48: Illustration of the failure of the active node...................................................................... 104 Figure 49: The 1+1 active/active redundancy model ......................................................................... 106 Figure 50: The N+M and N-way redundancy models........................................................................ 107 Figure 51: The N+M redundancy model with support for state replication ....................................... 108 Figure 52: The N+M redundancy model, after the failure of an active node ..................................... 108 Figure 53: the redundancy models at the physical level of the HAS architecture.............................. 109 Figure 54: The state diagram of the state of a HAS cluster node ....................................................... 112 Figure 55: The state diagram including the standy state .................................................................... 113 Figure 56: A HAS cluster using the HA NFS implementation .......................................................... 114 Figure 57: The HA-OSCAR prototype with dual active/standby head nodes.................................... 114 Figure 58: The physical view of the HAS architecture ...................................................................... 117 Figure 59: The no-shared storage model ........................................................................................... 119 Figure 60: The HAS storage model using a distributed file system................................................... 120 Figure 61: The NFS server redundancy mechanism .......................................................................... 120 Figure 62: DRDB disk replication for two nodes in the 1+1 active/standby redundancy model ....... 122 Figure 63: A HAS cluster with two specialized storage nodes .......................................................... 123 Figure 64: The master node stack....................................................................................................... 124 Figure 65: The traffic node stack........................................................................................................ 125 Figure 66: The redundant LAN connections within the HAS architecture ........................................ 126 Figure 67: The topology of the heartbeat Ethernet broadcast............................................................. 128 Figure 68: The CVIP generic configuration ....................................................................................... 131 Figure 69: Level of distribution.......................................................................................................... 132 Figure 70: Network termination concept............................................................................................ 133 Figure 71: The CVIP framework........................................................................................................ 134 Figure 72: Step 1 - Connection Synchronization................................................................................ 138 Figure 73: Step 2 - Connection Synchronization................................................................................ 138 Figure 74: Step 3 - Connection Synchronization................................................................................ 139 Figure 75: Step 4 - Connection Synchronization................................................................................ 139 Figure 76: Peer-to-peer approach ....................................................................................................... 140 Figure 77: The CPU information available in /proc/cpuinfo.............................................................. 143 Figure 78: The memory information available in /proc/meminfo ...................................................... 144 Figure 79: Example list of traffic nodes and their load index ............................................................ 146 Figure 80: Illustration of the interaction between the traffic client and the traffic manager .............. 147 Figure 81: The direct routing approach – traffic nodes reply directly to web clients......................... 149 Figure 82: The restricted access approach – traffic nodes reply to master nodes, who in turn reply to

the web clients ........................................................................................................................... 150 Figure 83: The dependencies and interconnections of the HAS architecture system software .......... 152 Figure 84: The sequence diagram of a successful request with one active master node.................... 157 Figure 85: The sequence diagram of a successful request with two active master nodes .................. 158 Figure 86: A traffic node reporting its load index to the traffic manager........................................... 159 Figure 87: A traffic node joining the HAS cluster ............................................................................. 160 Figure 88: The boot process of a diskless node.................................................................................. 161 Figure 89: The boot process of a traffic node with disk – no software upgrades are performed ....... 162 Figure 90: The process of rebuilding a node with disk ...................................................................... 163 Figure 91: The process of upgrading the kernel and application server on a traffic node.................. 164 Figure 92: The sequence diagram of upgrading the hardware on a master node ............................... 165 Figure 93: The sequence diagram of a master node becoming unavailable ....................................... 166 Figure 94: The NFS synchronization occurs when a master node becomes unavailable ................... 166

x

Figure 95: The sequence diagram of a traffic node becoming unavailable ....................................... 167 Figure 96: The scenario assumes that node C has lost network connectivity .................................... 168 Figure 97: The scenario of an Ethernet port becoming unavailable .................................................. 169 Figure 98: The sequence diagram of a traffic node leaving the HAS cluster .................................... 169 Figure 99: The LDirectord restarting an application process............................................................. 171 Figure 100: The network becomes unavailable ................................................................................. 172 Figure 101: The sequence diagram of the IPv6 autoconfiguration process ....................................... 173 Figure 102: A functional HAS cluster supporting IPv4 and IPv6...................................................... 175 Figure 103: A screen capture of the WebBench software showing 379 connected clients................ 177 Figure 104: The network setup inside the benchmarking lab ............................................................ 178 Figure 105: The benchmarked HAS cluster configurations showing Test-[1..4] .............................. 179 Figure 106: The results of benchmarking a standalone processor -- transactions per second ........... 181 Figure 107: The throughput benchmarking results of a standalone processor................................... 182 Figure 108: The number of failed requests per second on a standalone processor ........................... 182 Figure 109: The number of successful requests per second on a HAS cluster with four nodes ........ 184 Figure 110: The throughput results (KB/s) on a HAS cluster with four nodes.................................. 185 Figure 111: The number of failed requests per second on a HAS cluster with four nodes................ 185 Figure 112: The number of successful requests per second on a HAS cluster with six nodes .......... 187 Figure 113: The throughput results (KB/s) on a HAS cluster with six four nodes ............................ 187 Figure 114: The number of failed requests per second on a HAS cluster with six nodes.................. 188 Figure 115: The number of successful requests per second on a HAS cluster with 10 nodes ........... 190 Figure 116: The throughput results (KB/s) on a HAS cluster with 10 nodes .................................... 190 Figure 117: The number of successful requests per second on a HAS cluster with 18 nodes ........... 191 Figure 118: The throughput results (KB/s) on a HAS cluster with 18 nodes .................................... 192 Figure 119: The results of benchmarking the HAS architecture prototype ....................................... 193 Figure 120: The scalability chart of the HAS architecture prototype ................................................ 194 Figure 121: The possible connectivity failure points......................................................................... 195 Figure 122: The tested setup for data redundancy ............................................................................ 198 Figure 123: The modeled HA-OSCAR architecture, showing the three sub-models ........................ 200 Figure 124: A screen shot of the SPNP modeling tool ...................................................................... 201 Figure 125: System instantaneous availabilities ................................................................................ 203 Figure 126: Availability improvement analysis of HA-OSCAR versus the Beowulf architecture.... 204 Figure 127: The architecture of a Beowulf cluster............................................................................. 205 Figure 128: The architecture of HA-OSCAR .................................................................................... 207 Figure 129: The CGL cluster architecture based on the HAS architecture........................................ 210 Figure 130: The contributions of the HAS architecture..................................................................... 212 Figure 131: The untested configurations of the HAS architecture..................................................... 221 Figure 133: The architecture logical view with specialized nodes .................................................... 224

xi

List of Tables Table 1: Classification of clusters by usage and functionality ............................................................. 21 Table 2: Characteristics of SMP and cluster systems........................................................................... 22 Table 3: Expected service availability per industry type...................................................................... 24 Table 4: Advantages and drawbacks of clustering techniques operating at the OSI layer................... 31 Table 5: Web performance metrics ...................................................................................................... 69 Table 6: The results of benchmarking with Apache............................................................................. 78 Table 7: The results of benchmarking with Tomcat............................................................................. 81 Table 8: The possible redundancy models per each tier of the HAS architecture.............................. 110 Table 9: The supported redundancy models per each tier in the HAS architecture prototype ........... 111 Table 10: The performance results of one standalone processor running the Apache web server..... 180 Table 11: The results of benchmarking a four-nodes HAS cluster .................................................... 183 Table 12: The results of benchmarking a HAS cluster with six nodes............................................... 186 Table 13: The results of benchmarking a HAS cluster with 10 nodes ............................................... 189 Table 14: The summary of the benchmarking results of the HAS architecture prototype ................. 192 Table 15: Input parameters for the HA-OSCAR model ..................................................................... 201 Table 16: System availability for different configurations................................................................. 202 Table 17: The changes made to the Linux kernel to support NFS redundancy.................................. 217

1

Chapter 1 Introduction and Motivation

1.1 Internet and Web Servers

An Internet server is a server that provides services to users over the Internet using the client/server

model. We differentiate between the various flavors of Internet servers by the type of services they

offer, the characteristics of their workload, degree of high availability, security levels, performance,

throughput, and response times. A web server is one specialization of Internet servers. A web server

is a client/server program that uses the Hypertext Transfer Protocol (HTTP) to serve static and

dynamic contents to web users. We use web servers as a case study throughout this dissertation.

In recent years, the interest in and the deployment of scalable and highly available Internet servers has

increased rapidly for the wide potential such systems offer. The progress of Internet servers has been

feasible, driven by advances in network, software, and computer technologies. However, there are

still many challenges to resolve. Scalability is one of the biggest challenges facing Internet servers

providing interactive services for a large user base, and it presents itself as a crucial factor for the

success or failure of an online service.

The scalability of an Internet server refers to its ability to retain performance levels when adding

additional resources without requiring architectural changes or technology changes, and without

imposing additional overhead. For instance, a web server scales linearly if it continues to be available

and functional at consistent speeds as the number of users and requests continues to grow to high

numbers. As long as the web server continues to provide consistent performance in the face of rising

demand, then it is scalable. For instance, a web server is able to serve up to 1,000 requests per second

with a single processor (Section 3.7). If we double the number of processors, then each processor in

the cluster should theoretically be able to maintain a 1,000 request per second per processor, and the

cluster serving 2,000 requests per second, achieving linear scalability.

The common strategy in measuring server scalability is to measure throughput as the number of users

or traffic increases and identify important trends. For instance, we measure the throughput of the

server with 100 concurrent transactions, then with 1,000, and then with 10,000 transactions. We then

examine how throughput changes and observe how it compares with linear scalability. This

comparison gives us a measure of the scalability of the architecture.

2

To measure scalability, we need to benchmark and measure the performance of the web server.

Benchmarks operate on a real system or a working prototype. A benchmark is a publicly defined

procedure, designed to evaluate the performance of a web server system using a well-defined and

standardized workload model. The web server benchmark is a mechanism that generates a controlled

stream of web requests with standard metrics and aims at reproducing as accurately as possible the

characteristics of real traffic patterns, and reports the results.

The main goals of benchmarking are: to measure the performance and scalability of request

dispatching algorithms and request routing mechanisms, to assess the impact of changes in the

information system such as the system architecture and the distribution of content and services, and to

help tune the system configuration. Furthermore, benchmarking helps evaluate the system capacity

and response time with respect to an existing, expected, or standardized workload.

The common metrics for web server performance include throughput, connection rate, request rate,

reply rate, error rate, connect time, latency time, and transfer time. Section 3.4 discusses the metrics

for web server performance.

1.2 The Need for Scalability

The explosive growth of the Internet in the last few years has given rise to a vast range of new online

services. Current Internet services span a diverse range of categories that require significant

computational and I/O resources to process each request. Furthermore, exponential growth of the

Internet population is placing unprecedented demands upon the scalability and robustness of these

services [1][2]. Yahoo!, for instance, receives over 1.2 billion page views a day [3], while AOL’s web

caches service over 10 billion hits daily [4].

Internet services have become critical both for driving large businesses as well as for personal

productivity. Global enterprises are increasingly dependent upon Internet-based applications for e-

commerce, supply chain management, human resources, and financial accounting, while many

individuals consider e-mail and web access to be indispensable. This growing dependence upon

Internet services underscores the importance of their availability, scalability, and ability to handle

large loads. Popular web sites such as eBay [5], Excite [6], and E*Trade [7] have, at times,

experiences costly and high profile outages during periods of high load. As more people rely upon the

Internet for managing financial accounts, paying bills, and potentially even voting in elections, it is

increasingly important that these services are available at all times, perform well under high demand,

3

and are robust to accommodate to rapid changes in load. Furthermore, the variations of load

experienced by web servers intensify the challenges of building scalable and highly available web

servers. It is not uncommon to experience more than 100-fold increases in demand when a web site

becomes popular [8].

When the terrorist attacks on New York City and Washington DC occurred on September 11, 2001,

Internet news services reached unprecedented levels of demands. CNN.com, for instance, experienced

a two-and-a-half hour outage with load exceeding 20 times the expected peak [8]. Although the site

team managed to grow the server farm by a factor of five by borrowing machines from other sites,

this arrangement was not sufficient to deliver adequate service during the load spike. CNN.com came

back online only after replacing the front page with a text-only summary in order to reduce the load

[9]. Web sites are also subject to sophisticated denial-of-service attacks, often launched

simultaneously from thousands of servers, which can knock a service out of commission. Denial-of-

service attacks have had a major impact on the performance of sites such as Yahoo! and

whitehouse.gov [10]. The number of concurrent sessions and hits per day to Internet sites translates

into a large number of I/O and network requests, placing enormous demands on underlying resources.

1.3 Web Servers Overview

Web servers are a specialization of a server providing services over the Internet. The following sub-

sections describe the functions of a web server, present on the HTTP [11] request and reply cycle, and

discuss the protocol performance issues.

1.3.1 Web Server Definition

A web server, sometimes called HTTP server, is a program that responds to an incoming TCP

connection, and provides a service to the requester. Its primary purpose is to provide data using the

HTTP [11]. There are many variants of web servers for different operating system platforms.

Figure 1 illustrates the three core components of a web server: the web server software, a computer

system with a connection to the Internet, and the information or documents that are available for

serving. In this dissertation, the term web server refers to the whole entity, computer platform, server

software, and the documents.

4

Figure 1: Web server components

1.3.2 Functions of a Web Server

The primary function of a web server is to service requests made over the HTTP. The web server

receives a request asking for a specific resource, a document for instance, and then it returns the

resource as a response to the web client.

A web server consists of several internal components. Each component is responsible for imparting

certain functionality to the whole system. These components work in tandem, as well as with other

external components. The web server software is composed of a stream manager, a HTTP server, and

a path resolver. These internal components to the web server software interact with external

components, such as the common gateway interface (CGI) [12] and the file system. The file system is

the main source of information of a web server where data resides. The CGI provides an interface

where the use of scripts empowers the server to compute request specific information; it allows the

web clients to view documents created at run time.

The data flow in the server from the time a request enters the server until the server sends the output

back to the client follows a cycle called the data flow cycle. Figure 2 illustrates how a web server

handles an incoming web client request. The stream manager is the first contact point for a web client

with the web server. When a client sends a request to a server (1), it docks on to the stream manager.

A client might reference a file in its request, and as a result, the web server returns that file to the

client. A client can also request a program, a CGI script for instance, and the web server launches that

program and returns the resulting output to the client. The stream manager decodes the request (2)

and pushes the request to the HTTP server module. The HTTP server is the layer between the server

and the underlying operating system. It provides the necessary functionality to gain access to the file

system, either by connecting directly by making system calls to it or through the CGI. The HTTP

Web ServerMachine Disk

Storage

Web Clients

Network

Apache Web Server

Software

Web ServerMachine Disk

Storage

Web Clients

Network

Apache Web Server

Software

5

server decodes the path of the request from the uniform resource locator (URL) of the request (3)

using the path resolver. Next, the HTTP server authenticates the user using the access list, which

stores all the authorized users. On a successful authentication (4), the HTTP server accesses the file

system (5). Depending on the request type, the web server either retrieves a resource from the file

system, or executes a new program in a separate process. In both cases, the result of the above

operation is written on an output stream and passed on to the client via a stream manager (6).

Figure 2: Request handling inside a web server

Figure 3: Analysis of a web request

Figure 3 illustrates the two main phases that a web request goes through from outside the web server:

the lookup phase includes steps (1), (2), and (3), and the request phase includes steps (4) and (5).

UsersUsers

1

2

5

4

Local DNS Server

Authoritative DNS Server for

www.website.com

www.website.com

Web Server hostingwww.website.com

HTTP Request

Web Object

3142.133.17.22

142.133.17.22

UsersUsersUsersUsers

1

2

5

4

Local DNS Server

Authoritative DNS Server for

www.website.com

www.website.com

Web Server hostingwww.website.com

HTTP Request

Web Object

3142.133.17.22

142.133.17.22

StreamManager

HTTPServer

PathResolver

AccessList

CGI

Native File System

1Web Client

2

3

4

5

6

StreamManager

HTTPServer

PathResolver

AccessList

CGI

Native File System

StreamManager

HTTPServer

PathResolver

AccessList

CGI

Native File System

1Web Client

2

3

4

5

6

6

When the user requests a web site from the browser in the form of a URL, the request arrives (1) to

the local DNS server who consults (2) the authoritative DNS server responsible for the requested web

site. The local domain name system (DNS) server then sends back (3) the IP address location of the

web server hosting the requested web site to the client. The client requests (4) the document from the

web server using the web server’s IP address and the web server responds (5) to the client with the

requested web document.

Web servers should cope with the numerous incoming requests using minimum system resources.

They have to be multitasking to deal with more than one request at a time. They provide mechanisms

to control access authorization and ensure that the incoming requests are not a threat for the host

system, where the web server software runs. In addition, web servers respond to error messages they

receive, negotiate a style and a language of response with the client, and in some cases run as a proxy

server. Web servers generate logs of all connections for statistics and security reasons.

1.3.3 The HTTP Request and Reply Cycle

Web servers follow the client/server model, where a client sends a request to the server and the server

replies to the client. This cycle is called the HTTP request and reply cycle. This section illustrates the

steps that take place in a HTTP request and response cycle. The web server, running as a background

process, listens on a specific port for requests addressed to it; the default port where web servers

listen to incoming connections is port 80. A HTTP request arrives from a client to the web server. The

request includes the information needed by the web server to decide what to do with the request. The

web server reads the request and parses it to determine how to handle it. The web server tries to fulfill

the request. If the client is requesting a document, the web server locates the document on the disk.

Each web server has a repository of data that each request accesses. If there is only one copy of the

data, then all concurrently processed requests funnel through the one database of documents. The web

server sends a response to the client, which is in this case the requested document. Finally, the web

server cleans up by closing open files and terminates the network connection.

This cycle illustrates how a web server handles one request per connection. With the newer version of

the HTTP, HTTP version 1.1, the protocol sets up a persistent connection between the web client and

the web server, which stays open and allows the web client to send multiple requests. The operating

system completes several parts of the HTTP request and reply cycle such as creating a new process to

7

handle an incoming request, reading from the network, looking up the requested document, reading

the document from disk, and writing the document onto the network.

1.3.4 Web Servers Requirements

This section discusses the requirements of a highly available and scalable web server related to this

dissertation such as minimal response time, fast processing of requests, high availability, and

reliability.

- Minimal response time: A crucial factor in the success of a web service is the response time

experienced by the web user. The server has to minimize response times to meet the expectations

of the users. Research on a wide variety of web systems has demonstrated that users require

response times of less than one second when moving from one page to another when browsing

the web [13]. Traditional human factors research into response times confirmed the need for

response time faster than one second. The general guidelines regarding response times has been

about the same for almost thirty years [14][15]:

> 0.1 second is the approximate limit for having the user feel that the system is reacting

instantaneously, meaning no special feedback is necessary to display the result.

> 1.0 second is the approximate limit for users’ flow of thought to remain un-interrupted, even

if the user notices the delay. Normally, no special feedback is necessary during delays of

more than 0.1 second and less than 1.0 second, but the user does lose the feeling of operating

directly with the data.

> 10 seconds is the approximate limit for keeping the attention of the user focused on the

dialogue. For longer delays, users want to perform other tasks while waiting for the computer

to finish, so they expect some feedback indicating when the computer expects to finish.

Feedback during the delay is especially important if the response time is likely to be highly

variable, since users do not know what to expect.

- Fast processing capability: In a distributed environment, several factors introduce delays that

lessen the speed of processing of a web request, and the speed in which the traffic is distributed

among the servers within the web cluster. Therefore, a fast and lightweight traffic distribution

mechanism can help speed up processing time of web requests.

- Reliability and availability: Web servers need to be reliable and highly available. As the number

of users and the volume of data handled increases, it becomes difficult to guarantee web server

8

reliability and availability. Therefore, web servers should deploy hardware and software fault-

tolerance and redundancy mechanisms to ensure reliability, to prevent single points of failure, and

to maintain availability in case of a hardware or software failure.

- Ability to sustain a guaranteed number of connections: This requirement entails the web server to

maintain a minimum number of connections per second and to process these connections

simultaneously. The ability to sustain a guaranteed number of connections, also described as

maintaining the base performance, has a direct effect on the total number of requests the web

server can process at any point in time.

- High storage capacity: Web servers provide I/O storage capacity to store data and a variety of

information it is hosting. In addition, with the increased demand on multimedia data, requiring

fast data retrieval is an essential requirement.

- Cost effectiveness: An important requirement governing the future of web servers is their cost

effectiveness. Otherwise, the cost per transaction increases as the number of transactions increase,

and as a result, the cost becomes an important factor governing which server architecture and

software to use in any given deployment case.

Designing a high performance and scalable web and Internet server is a challenging task. This

dissertation aims to understand what causes scalability problems in the web server cluster and

explores how we can scale a web server cluster. The dissertation focuses on the design of next

generation cluster architecture to meet the requirements discussed above. The architecture need to be

able to scale linearly for up to 16 processors, support service availability, and reliability. The

architecture will inherently meet other requirements such as better cluster resource utilization and the

ability to handle different types of traffic.

1.4 Properties of Internet and Web Applications

To scale the performance of a distributed web server, system designers need to understand its

structure and purpose, as well as the characteristics of the provided services and data, for these all

affect the architecture and the scaling methods. The three fundamental properties of web applications,

also applicable to web servers, are massive concurrency demands, an increasing trend towards

complex content, and a need to be extremely robust to load. The following subsections discuss each

of these properties.

9

1.4.1 High Concurrency

The growth in popularity and functionality of Internet and web services has been astounding. While

the world wide web itself is growing in size, with recent estimates anywhere between 1 billion and

2.5 billion unique documents, the number of users on the web is also growing at a staggering rate

[16][17]. In April 2002, Nielsen NetRatings estimated that there are over 422 million Internet users

worldwide [18]. Consequently, Internet and web applications need to support unprecedented

concurrency demands, and these demands are increasing over time.

1.4.2 Dynamic Content

In recent years, the use of dynamic content generation has become more widespread. Processing

requests for dynamic content requires significant amounts of computation and I/O to generate. A

dynamic Internet service involves a range of systems, including web servers, middle-tier application

servers, and back-end databases, to process each request. The processing for each request might

include encryption and decryption, server side scripting, database access, or access to legacy systems

such as mainframes.

Such dynamic services require more resources. The content generated by a dynamic web service is

often not amenable to caching, so each request demands a large amount of server resources. In

addition, the resource demands for a given service are difficult to predict. Apart from placing new

demands on networks, this growth in content requires that Internet and web services to be responsible

for dedicating resources to store and serve vast amounts of data.

1.4.3 Robustness to Load

The demand for Internet services can spike rapidly and unexpectedly, causing the peak load to

achieve higher magnitude than the average load [5][6][7][8]. Given the vast user population on the

Internet, a site can suddenly become flooded with a surge of requests that far exceed its ability to

deliver service. A common approach to dealing with heavy load is to over-provision resources.

However, over provisioning is not feasible when the ratio of peak to average load is very high; it may

not be practical to add 10 or 100 times the number of machines needed to support the average load

case. This approach also neglects cost and system administration issues.

Since we cannot expect Internet services to scale to peak demand, it is critical that we design services

that can accommodate high load. When the demand on a service exceeds its capacity, a service should

10

not over commit its resources and degrade in a way that all clients suffer. Rather, the service need to

be aware of overload conditions and attempt to adapt to them, by degrading the quality of service

delivered to clients, or by predictably shedding load, such as by giving users some indication that the

service is saturated. It is far better for an overloaded service to inform users of the overload than to

silently drop requests.

1.5 Study Objectives

The goal of this study is to design an architecture for a highly available and scalable web server

cluster. The main motivations to design such architecture are to increase system capacity, availability,

and scalability, and to increase the rate at which the server accepts and processes incoming

connections. The architecture has to be flexible and modular, capable of linearly scaling as we add

more processors, while maintaining the baseline performance and without affecting the availability of

the service. As a result, as we double the number of processors in the web cluster, we expect to

double the number of requests per second served per cluster processor, and to maintain the level of

throughput per processor. The emphasis of the study is therefore on scalability through clustering,

effective resource utilization, and efficient distribution of user requests among the cluster nodes

through traffic differentiation. The highly available web server cluster should also adapt itself

dynamically to different numbers of users and amounts of data, with minimal effect on performance.

The goal of the architecture is achieve maximum scalability and performance levels for up to 16

processors, with each processor in the cluster handling N requests per second. N refers to the baseline

performance that reaches a threshold of over 1,000 requests per second (Section 3.7).

The limitation is set to 16 processors for two main reasons. First, based on our initial experiments

with web clusters (Chapter 3), we demonstrated that web clusters suffer from scalability problems

starting with small clusters that consist of four processors. Our experimental work confirmed

performance degradation and loss of scalability as we increased the number of nodes in a clustered

web server from four to eight, 10, and then to 12 nodes. The second reason for choosing 16

processors as the upper limit relies on the facts we collected from our literature review (Section 2.10).

Sections 2.9 and 2.10 present and discuss ongoing projects that focus on scalability and performance

of web servers from an architectural point of view, and experimenting with web clusters that consist

of eight processors and less. In addition, with an upper limit of 16 processors, we can demonstrate

11

and validate the scalability of the architecture in the lab without resorting to building a theoretical

model and simulating it.

1.6 Scope of the Study

This section defines the scope of the study with respect to architecture, type of servers covered by the

study, types of web site content, architectural model, scalability, and high availability. The scope is

on the architecture and its parameters and excludes external factors such as the performance of file

systems, networks, and operating systems.

1.6.1 Goal

The goal of this study is to propose an architecture for scalable and highly available web server

clusters. The architecture allows the following properties: fast access, linear scalability for up to 16

processors, architecture transparency, high availability, and robustness of offered services.

1.6.2 Internet Servers

This dissertation investigates web servers as specific case of an Internet server. A web server is a

program that uses the client/server model and the HTTP to serve files and documents to web users.

1.6.3 Web Site Content

For experimental purposes, we limit the experimental work with web server scalability testing to

static web site contents, which does not include dynamic transaction or web searches. In spite of this

self-set limitation, we supported dynamic test suites in our benchmarking experiments for the

proposed architecture. WebBench [19], the benchmarking tool used in the benchmarks, has the

facilities to include dynamic testing as part of its features.

1.6.4 Architecture and Servers

We aim to reduce architecture complexities by providing a generic architecture that can handle

massive concurrency demands and deal gracefully with large variations in load. The scope of the

proposed architecture is limited to systems that follow the client/server model and run applications

characterized by short transactions, short response time, a thin control path, and static delivery data

pack.

12

This dissertation does not address multimedia servers, streams, sessions, states, and applications

servers. In addition, it does not address nor try to fix problem with networking protocols. In addition,

high performance computing (HPC) is not in the scope of the study. HPC is a branch of computer

science that concentrates on developing software to run on supercomputers. The HPC research

focuses on developing parallel processing algorithms that divide large computational task into small

pieces so that separate processors can execute simultaneously. Architectures in this category focus on

maximizing compute performance for floating point operations. This branch of computing is

unrelated to the dissertation.

The architecture targets servers providing services over the Internet with the characteristics

previously mentioned. The architecture applies to systems with short response times such as, but not

exclusively, web servers, Authentication Authorization and Accounting (AAA) servers, Policy

servers, Home Location Register (HLR) servers, Service Control Point (SCP) servers, without

specialized extensions needed at the architectural level.

1.6.5 Scalability

Existing server scaling methods rely on adding more hardware, upgrading processor and memory, or

distributing the incoming load and traffic by distributing users or data into several servers. These

schemes, discussed in Chapter 2, suffer from different shortcomings, which can cause uneven load

distribution, create bottlenecks, and obstruct the scalability of a system. These scaling methods are

out of our scope and we do not aim to improve them.

Our goals with the architecture are to achieve scalability through clustering, where we dynamically

direct incoming web requests to the appropriate cluster node, and scale the number of serving nodes

with as little overheard or drop in baseline performance as possible. Therefore, our focus is scalability

while maintaining a high throughput. The cluster needs to be able to distribute the application load

across N cluster nodes with linear or close-to linear scalability.

1.6.6 High Availability

This work targets the architecture for highly available (HA) web servers. From this perspective, the

scope of this dissertation covers techniques that allow us to embed high availability capabilities at the

architecture level to ensure the high levels of availability with the least performance degradation. The

goal of high availability is to maximize service availability for the end users. There are two sub-

13

categories: HA stateless, with no saved state information and the HA stateful with state information

that allows the web application to maintain sessions across a failover. Our scope focuses on the HA

stateless web applications, although we can apply the same principles to the HA stateful web

applications.

1.7 Thesis Contributions

This dissertation proposes highly available and scalable web server architecture. It combines the use

of clustering technologies, high availability mechanisms, efficient and dynamic traffic distribution

mechanism to handle additional requests and maintain scalability across cluster nodes as the web

traffic increases. We evaluate and validate the architecture through performance and scalability, and

HA testing, to demonstrate its robustness, scalability for up to 16 processors, and to demonstrate its

availability.

This dissertation offers contributions in the areas of cluster architecture, scalability, traffic

management, high availability, single cluster interface, in addition to contributions to the open source

and the industry.

Architecture for highly available and scalable Web clusters: The architecture consists of multiple

connected server elements that allow the use of server building blocks to achieve high availability,

high performance, and high scalability. Chapter 4 presents the HAS architecture.

Scalability and capacity: We are able to scale the architecture and serve more traffic by adding

processors to the web cluster without affecting the servicability or the uptime of the provided

services. Chapter 5 presents and discusses the results of benchmarking the prototyped cluster.

Traffic management: We designed and prototyped a dynamic traffic distribution mechanism that

consists of a lightweight load monitoring mechanism that reports the load on the cluster nodes to the

scheduler, and a scheduler that distributes incoming traffic based on the distribution policy.

Furthermore, the traffic management scheme incorporates a keep-alive mechanism, through which the

scheduler is aware when a traffic node becomes unavailable and then stops forwarding traffic to it.

Section 4.23 discusses the traffic management scheme.

High Availability: We achieve HA by supporting different redundancy models on all tiers of the

architecture. Sections 4.7, 4.8, and 4.19 discuss our contributions in the area of HA.

A transparent, single cluster IP interface for the cluster: The cluster IP interface is an IP stack

distributed among a number of processors that allows access to a cluster of processors via a single IP

14

address. It provides a single entry to the cluster by hiding the complexity of the cluster, and provides

address location transparency, to address a resource in the cluster without knowing or specifying the

processor location. Section 4.21 presents this contribution.

Application availability: The architecture provides the capabilities to monitor the health of the

application server running on the traffic nodes, and dynamically exclude the node from the cluster in

the event the application process fails. Section 4.20 discusses these capabilities. Furthermore, with

connection synchronization between the two master nodes, in the event of failure of a master node,

the standby node is able to continue serving the established connections. Section 4.22 discusses

connection synchronization.

Contribution to Open Source: This work has resulted in several contributions to the the HA-OSCAR

project [21], whose architecture is based on the HAS architecture. Section 5.12 discusses these

contributions.

Contributions to the industry: The Carrier Grade industry initiative [26] at the Open Source

Development Labs [27] has adopted the HAS architecture as the base standard architecture for carrier

grade clusters running telecommunication applications. Section 5.15 discusses this contribution.

Other contributions include benchmarking current solutions, providing enhancements to their

capabilities and adding functionalities to existing system software, and providing best practices for

building benchmarking environment for large-scale systems.

1.8 Dissertation Roadmap

The dissertation consists of six chapters.

Chapter 1 provides the necessary background on Internet and web servers, discusses scalability

challenges, and presents the objectives and scope of the study.

Chapter 2 focuses on three main topics: clustering technologies, scalability challenges and related

work. The clustering section sets the grounds for all cluster related definitions and approaches to

build clustered web servers. It presents clustering technologies and the benefits resulting from using

these technologies to design and build Internet and web servers. The chapter introduces software and

hardware clustering technologies, their advantages and drawbacks, and discusses our experience

prototyping a highly available, and scalable, clustered web server platform. The chapter describes the

need of scalable servers, discusses how users experience scalability issues, and examines the resulting

performance problems. The last part of this chapter presents our literature review in the area of

15

scaling Internet and web servers. It presents a survey of academic and industry research projects,

discusses their focus areas, results, and contributions. It also presents the contributions of these

projects into this dissertation to help us achieve our goal of a scalable and highly available web server

platform.

Chapter 3 summarizes the technical preparatory work we completed in the laboratory prior to

designing the Highly Available and Scalable (HAS) architecture. This chapter describes the

prototyped web cluster that uses existing components and mechanisms. It also describes the

benchmarking environment we built specifically to test the performance and scalability of web

clusters and present the benchmarking results of the tests we conducted on the prototyped cluster.

Chapter 4 focuses on describing and discussing the HAS architecture. It presents the architecture, its

components, and their characteristics. The chapter then discusses the conceptual, physical, and

scenario architecture views, the supported redundancy models, the traffic distribution scheme, and the

dependencies between the various components. It also covers the architecture characteristics as

related to eliminating single points of failures.

Chapter 5 presents the validation of the architecture and illustrates how it scales for up to 16

processors without performance degradation. The validation covers two aspects: scalability and

availability. The chapter presents the results of the benchmarking tests we conducted on the HAS

architecture prototype. It also presents the results of experiments we conducted to test the availability

features in a HAS cluster.

Chapter 6 presents the contributions and future work in the areas of scalability and performance of

Internet and web servers.

16

Chapter 2 Background and Related Work

2.1 Cluster Computing

Cluster computing has become increasingly popular for a number of reasons, which include low cost,

high performance of commodity-off-the-shelf (COTS) hardware, availability of rapidly maturing

proprietary and open source software components to support high performance and high availability

applications. Furthermore, the high-speed networking and the improved microprocessor performance

have positioned networks of workstations as a more appealing vehicle for distributed computing

compared to the Symmetric Multiprocessor (SMP) and Massively Parallel Processor (MMP) systems.

As result, clusters built using COTS hardware and software are playing a major role in redefining the

concept of supercomputing and are becoming a popular alternative to SMP and MPP systems.

However, there are still several challenges in the areas of performance, availability, manageability,

and scalability that are key issues.

A web server is an example of an application that requires more computing power than what a single

computer can provide, and is a candidate to run on a computer cluster. A viable and cost-effective

web cluster solution consists of connecting multiple computers together and coordinates their efforts

to serve web traffic. The resulting system is a distributed web server that responds to incoming traffic

and processes them on multiple processors.

The following sub-sections introduce the three general classes of distributed systems: SMP, MPP, and

clusters, compare their advantages and drawbacks, and identify the model that is more suitable for

hosting web servers.

2.1.1 Symmetric Multiprocessors (SMP)

An SMP machine consists of tightly coupled series of identical processors, operating on a single

shared bank of memory. There are no multiple memories, input and output (I/O) systems, or operating

systems. SMP systems are shared everything systems, where each processor has access to the shared

memory system and all of the attached devices and peripherals of the system, perform I/O operations,

and interrupt other processors [28][29]. Figure 4 illustrates the architecture of an SMP system. SMP

systems have a master processor at boot time, and then the operating system starts up the second (or

17

more) processor(s) and manages access to the shared resources among all the processors. A single

copy of the operating system is in charge of all the processors. SMP systems available on the market

(at the time of writing) do not exceed 16 processors, with configurations available in two, four, eight,

and 16 processors.

Figure 4: The SMP architecture

SMP systems are not scalable because all nodes have to access the same resources. In addition, SMP

systems have a limit to the number of processors they can have. They require considerable

investments in upgrades, and an entire replacement of the system to accommodate a larger capacity.

Furthermore, an SMP system runs a single copy of the operating system, where all processors share

the same copy of the operating system data. If one processor becomes unavailable because of either

hardware or software error, it leaves locks unlocked, data structures in partially updated states, and

potentially, I/O devices in partially initialized states. As a result, the entire system becomes

unavailable on the account of a single processor. In addition, SMP architectures are not highly

available. SMP systems have several single points of failure (cache, memory, processor, bus); if one

subsystem becomes unavailable, it brings the system down and makes the service unavailable to the

end users.

2.1.2 Massively Parallel Processors (MPP)

As we add more processors to SMP systems and more hardware such as crossbar switches, to make

memory access orthogonal, these parallel processors become Massively Parallel Processors. Figure 5

illustrates an MPP system that consists of several processing elements interconnected through a high-

speed interconnection network. Each node has its own processor(s), memory, and runs a separate

copy of the operating system [28]. The key distinction between MPP and SMP systems relies in the

Processor A Processor B Processor N

Memory

Cache Cache Cache

IO

System Bus

Processor A Processor B Processor N

Memory

Cache Cache Cache

IO

System Bus

18

use of fully distributed memory. In an MPP system, each processor is self-contained with its own

cache and memory chips.

Figure 5: The MPP architecture

Another distinct characteristic of MPP systems is the job scheduling subsystem. We achieve job

scheduling through a single run queue. MPP systems tackle one very large computational problem at

a time and serve to solve HPC problems. In addition, MPP systems suffer from the same issues as

SMP systems in the areas of scalability, single points of failures, and the impact on high availability,

and the need to shutdown the system to perform either software or hardware upgrades.

2.1.3 Computer Clusters

In his book “In Search of Clusters”, Greg Pfister defines a cluster as a parallel or distributed system

consisting of a collection of interconnected whole computers that appear as a single, unified

computing resource [28]. In this dissertation, we consider a cluster as a group of separate computers

that are interconnected, and are used as a single computing entity to provide a service or run an

application for the purposes of scalability, high avsailability, or high performance computing. Our

definition of a cluster is complimentary to Pfister’s definition. We both agree that a cluster consists of

a number of independent computers that appear as a single compute entity to the end users. End users

are not aware that they are interacting with a cluster and they are not aware as well where the

application is running.

Processor

Memory

Cache

Processor

Memory

Cache

Processor

Memory

Cache

Processor

Memory

Cache

Processor

Memory

Cache

Processor

Memory

Cache

Interconnecting Network

Processor

Memory

Cache

Processor

Memory

Cache

Processor

Memory

Cache

Processor

Memory

Cache

Processor

Memory

Cache

Processor

Memory

Cache

Interconnecting Network

19

Figure 6 illustrates the generic cluster architecture, which consists of multiple standalone nodes that

are connected through redundant links, and providing a single entry to the cluster.

Figure 6: Generic cluster architecture

Cluster nodes interconnect in different ways. Figure 7 illustrates two common variations. In the first

variation, Figure 7-A, cluster nodes share a common disk repository; in the second variation, Figure

7-B, cluster nodes do not share common resources and use their own local disk for storage.

Figure 7: Cluster architectures with and without shared disks

The phrase single, unified computing resource in Greg Pfister definition of a cluster invokes a wide

variety of possible applications and uses, and is deliberately vague in describing the services provided

by the cluster. At one end of the spectrum, a cluster is nothing more than the collection of whole

computers available for use by a sophisticated distributed application. At the other end, the cluster

creates an environment where existing non-distributed programs can benefit from increased

Node NNode CNode A Node B. . .

Single entry point

The InternetRedundant LANs

. . .

Cluster

UsersUsersNode NNode CNode A Node B

. . .

Single entry point

The InternetRedundant LANs

. . .

Cluster


Shared Disk Shared Nothing

Network

Figure 7-A: Shared Storage Figure 7-B: Individual Node Disks

Shared Disk Shared Nothing

Network

Figure 7-A: Shared Storage Figure 7-B: Individual Node Disks

20

availability because of the cluster wide fault masking, and increased performance because of the

increased computing capacity.

A cluster is a group of independent COTS servers interconnected through a network. The servers,

called cluster nodes, appear as a single system, and they share access to cluster resources such as such

as shared disks, network file systems and the network. A network interconnects all the nodes in a

cluster and it is separate from the cluster’s external environment such as the local Intranet or the

Internet. The interconnection network employs local area network or systems area network

technology.

Clusters can be highly available because of the of build-in redundancy that prevents the presence of a

single point of failure (SPOF). As a result, failures are contained within a single node. Monitoring

software continually runs checks, by sending signals also called heartbeats, to ensure that the cluster

node and the application running on are up and available. If these signals stop, then the system

software initiates the failover to recover from the failure. The presumably dead or unavailable system

or application is then isolated from I/O access, disks, and other resources such as access to the

network; furthermore, incoming traffic is redirected to other available nodes within the cluster. As for

performance, clusters allow the possibility to add nodes and scale up the performance, the capacity,

and the throughput of the cluster as the number of users or traffic increases.

Figure 8: A cluster node stack

Figure 8 illustrates at a high level a cluster node stack. The stack consists of the processor, the

interconnect technology and protocol, the operating system, the middleware or system software, and

the application servers. At the lowest level is the node hardware. One level up from the Node level is

the Interconnect Technology (such as Ethernet), followed by the Interconnect Protocol. Up one level

from the Interconnect Protocol is the Operating System, followed by the System Software, which

Nodes

Interconnect Protocol

Operating System

Middleware / System Software

Applications

Interconnect Technology

Nodes


Operating System

Middleware / System Software

Applications


21

provides all of the support functionalities. Finally at the top level is the Applications Layer. We

utilize clusters in many modes including but not limited to high performance computing, high

capacity or throughput, scalability, and high availability. Table 1 presents the different types of

clusters depending on their functionalities and the types of applications they host.

Purpose of Clustering

High Performance Computing (HPC)

Scalability High Availability (HA)

Server Consolidation

Goal Maximize floating point computation performance

Maximize throughput and performance

Maximize service availability

Maximize ease of management of multiple computing resources

Description - Many nodes working together on a single-compute based problem.

- Performance is measured as the number of floating point operations (FLOP) per second

- Many nodes working on similar tasks, distributed in a defined fashion based on system load characteristics

- Performance is measured as throughput in terms of KB/s

- Adding more nodes to the cluster to increase its capacity

- Network oriented (network throughput), or Data oriented (data transactions)

- Redundancy and failover for fault tolerance of services

- Support stateless and stateful applications

- Availability is measured as a percentage of the time the system is up and providing service. Section 4.7 presents the formula for calculating the availability.

- Also called Single System Image (SSI)

- Provide a central management of cluster resources and treat the cluster as a single management unit.

Examples Beowulf clusters such as MOSIX [30][31], Rocks [32], OSCAR [33] [34] [35], and Ganglia [36]

Examples include the Linux Virtual Server [23], TurboLinux [37], in addition to commercial database products

Examples include the HA-OSCAR project [21], in addition to commercial clustering products

Examples include the OpenSSI project [38], the OpenGFS project [39], and the Oracle Cluster File System [40]

Table 1: Classification of clusters by usage and functionality

22

Clustering for scalability (Table 1, 3rd column) focuses on distributing web traffic among cluster

nodes using distribution algorithms such as round robin DNS.

Clustering for high availability (Table 1, 4th column) relies on redundant servers to ensure that critical

applications remain available if a cluster node fails. There are two methods for failover solutions:

software-based failover solutions discussed in Section 2.7.1.2, and hardware-based failover devices

discussed in Section 2.7.1.1. Software-based failover detects when a server has failed and

automatically redirect new incoming HTTP requests to the cluster members that are available.

Hardware-based failover devices have limited built-in intelligence and require an administrator's

intervention when they detect a failure.

Many of the clustering products available fit into more than one of the above categories. For instance,

some products include both failover and load-balancing components. In addition, SSI products that fit

into the server consolidation category (Table 1, 5th column) provide certain HA failover capabilities.

Our goal with this dissertation is a cluster architecture that targets both scalability and high

availability.

2.2 SMP versus Clusters

Table 2 summarizes the comparisons between SMP and cluster architectures.

Scaling High Availability System Management

SMP - Limited scaling capabilities

- Requires a complete upgrade of the system

- Not highly available

- Single points of failure in hardware and operating system

- Single system

- Single image of the operating system

Clusters - Virtually unlimited scaling by adding nodes to the cluster

- Can be configured to have no SPOF through redundancy of key cluster components

- There exist several challenges to achieve continuous availability of service

- Multi-node system

- Each node runs its own copy of the operating system and application

- Allows flexibility in configuration

Table 2: Characteristics of SMP and cluster systems

SMP systems have limited scalability, while clusters have virtually unlimited scaling capabilities

since we can always continue to add more nodes to the cluster. As for high availability, an SMP

23

system has several single points of failure; one single error can lead to a system downtime; in contrast

to a cluster, where functionalities are redundant and spread across multiple cluster nodes. As for

management, an SMP system is a single system, while a cluster is composed of several nodes, some

of which can be SMP machines.

2.3 Cluster Software Components

We classify the software components that comprise the environment of a commodity cluster in four

major categories: the operating system that runs on each of the cluster nodes, application execution

environment such as libraries and debuggers, cluster installation infrastructure, and cluster services

components such as cluster membership, storage, management, traffic distribution and application

services.

The critical software components include cluster membership, storage, fault management, and traffic

distributions services. The cluster membership service includes functions to recognize and manage

the nodes membership in the cluster. The cluster storage service includes the replication and retrieval

of cluster configuration and application data. The fault management service includes functions to

recognize hardware and software faults and recovery mechanisms. The traffic distribution service

includes functions to distribute the incoming traffic across the nodes in the cluster.

2.4 Cluster Hardware Components

The key components, which comprise a commodity cluster, are the nodes performing the computing

and the dedicated interconnection network providing the data communication among the nodes. A

cluster node is different from an MPP node in that a cluster node is an operational standalone

computing system. A cluster node integrates several key subsystems that include the processor,

memory, storage, external interfaces, and network interfaces.

2.5 Benefits of Clustering Technologies

Computer clusters provide several advantages including high availability, scalability, high

performance compared to single server architectures, rapid response to technology improvements,

manageability of multiple servers as a single server, transparency, and flexibility of configuration.

The following sub-sections present these advantages and benefits of clusters.

24


High availability (HA) refers to the availability of resources in a computer system [41]. We achieve

HA through redundant hardware, specialized software, or both [41][42][43]. With clusters, we can

provide service continuity by isolating or reducing the impact of a failure in the node, resources, or

device through redundancy and fail over techniques. Table 3 presents the various levels of HA, the

annual downtime and type of applications for various classes of systems [44].

9's Availability Downtime per year Example Areas for Deployments 1 90.00% 36 days 12 hours Personal clients 2 99.00% 87 hours 36 minutes Entry-level businesses 3 99.90% 8 hours 46 minutes ISPs, mainstream businesses 4 99.99% 52 minutes 33 seconds Data centers 5 99.999% 5 minutes 15 seconds Telecom system, medical, banking 6 99.9999% 31.5 seconds Military defense, carrier grade routers

Table 3: Expected service availability per industry type

It is important that a service not only be down except for N minutes a year, but also that the length of

outages be short enough, and the frequency of outages be low enough, that the end user does not

perceive it as a problem. Therefore, the goal is to have a small number of failures and a prompt

recovery time. This concept is termed Service Availability, meaning whatever services the user wants

are available in a way that meets the user’s expectations.

2.5.2 Scalability

Clusters provide means to reach high levels of scalability by expanding the capacity of a cluster in

terms of processors, memory, storage, or other resources, to support users and traffic growth [1].

2.5.3 High Performance

We can achieve a better performance characterized by improved processing speed by using clusters

and the cluster-wide resources instead of the resources of a standalone server.

2.5.4 Rapid Response to Technology Improvements

Commodity clusters are most able to track technology improvements and respond rapidly to new

offerings. Clusters benefit from the latest mass-market technology, as it becomes available at low

cost. As new devices including processors, memory, disks, and network interfaces become available

in the market, we can integrate them into cluster nodes allowing clusters to be the first class of

25

parallel systems to benefit from such advances. Similarly, clusters benefit from the latest operating

system and networking features, as they become available.

2.5.5 Manageability

Clusters require a management layer that allows us to manage all cluster nodes as a single entity [28].

Such cluster management facilities help reduce system management costs. There exists a significant

number of cluster management software, almost all of them originating from research projects, and

are now adopted by commercial vendors.

2.5.6 Cost Efficient Solutions

Clusters take advantage of COTS hardware, which allows a better price to performance ratio when

compared to a dedicated parallel supercomputer. In addition, the availability of open source operating

systems, system software, development tools, and applications has contributed to the provisioning of

cost effective clusters built with open source software.

2.5.7 Expandability and Upgradeability

We can expand clusters by adding more nodes, disk storage, and memory, as necessary. As hardware,

software, operating system and network upgrades become available, we can upgrade a cluster node

independently of the others; this presents a major advantage over SMP and MMP systems.

2.5.8 Transparency

The SSI layer represents the nodes that make up the cluster as a single server. It allows users to use a

cluster easily and effectively without the knowledge of the underlying system architecture or the

number of nodes inside the cluster. This transparency frees the end-user from having to know where

an application runs.

2.5.9 Flexibility of Configuration

Clusters allow flexibility in configurations that is not available through conventional SMP and MPP

systems. The number of cluster nodes, memory capacity per node, number of processors per node,

and interconnect topology, are all parameters of the cluster structure that may be specified in fine

detail on a per system basis without incurring additional cost. Furthermore, we can modify the cluster

structure or augment it over time as need and opportunity dictates. This expanded control over the

26

cluster structure not only benefits the end user but the cluster vendor as well, yielding a wide array of

system capabilities and cost tradeoffs to meet customer demands.

2.6 The OSI Layer Clustering Techniques

The following sub-sections present the clustering techniques that operate at OSI layer two (data link

layer), OSI layer three (network layer), and OSI layer seven (application layer).

2.6.1 L4/2 Clustering

The level 4 web switch works at the TCP/IP level. Figure 9 illustrates the L4/2 clustering model. The

same server machine must process the packets pertaining to the same connection, and the web switch

maintains a binding table to associate each session with its assigned server.

Figure 9: The L4/2 clustering model

In L4/2 based clusters, the dispatcher and all the servers in the cluster share the cluster network-layer

address using primary and secondary IP addresses. While the primary address of the dispatcher is the

same as the cluster address, each cluster server is configured with the cluster address as a secondary

address using either interface aliasing or by changing the address of the loopback device on the

cluster servers. The nearest gateway is configured such that all packets arriving for the cluster address

are addressed to the dispatcher at layer two using a static Address Resolution Protocol (ARP) cache

entry. If the packet received corresponds to a TCP/IP connection initiation, the dispatcher selects one

of the servers in the server pool to service the request (Figure 9).

The selection of the server to respond to the incoming request relies on a traffic distribution algorithm

such as round robin. When an incoming request arrives to the dispatcher, the dispatcher creates an

entry in a connection map that includes information such as the origin of the connection and the

chosen cluster server. The layer two destination address is then rewritten to the hardware address of

Dispatcher

Server 1

Server n

Requests

Replies

Replies

.

.

.

Dispatcher

Server 1

Server n

Requests

Replies

Replies

.

.

.

27

the chosen cluster server, and the frame is placed back on the network. If the incoming packet is not

for a connection initiation, the dispatcher examines its connection map to determine if it belongs to a

currently established connection. If it does, the dispatcher rewrites the layer two destination address

to be the address of the cluster server previously selected, and forwards the packet to the cluster

server as before. In the event that the received packet does not correspond to an established

connection and is not a connection initiation packet, then the dispatcher drops it.

Figure 10 illustrates the traffic flow in an L4/2 clustered environment [45]. A web client sends an

HTTP packet (1) with A as the destination IP address. The immediate router sends the packet to the

dispatcher at IP address A (2). Based on the traffic distribution algorithm and the session table, the

dispatcher decides which back-end server will handle this packet, server 2 for instance, and sends the

packet to server 2 by changing the MAC address of the packet to server 2's MAC address and

forwarding it (3). Server 2 accepts the packet and replies directly to the web client.

Figure 10: Traffic flow in an L4/2 based cluster

L4/2 clustering has a performance advantage over L4/3 clustering because of the downstream bias of

web transactions. Since the network address of the cluster server to which the packet is delivered is

identical to the one the web client used originally in the request packet, the cluster server handling

that connection may respond directly to the client rather than through the dispatcher. As a result, the

dispatcher processes only the incoming data stream, which is a fraction of the entire transaction.

Moreover, the dispatcher does not need to re-compute expensive integrity codes (such as the IP

checksums) in software since only layer two parameters are modified. Therefore, the two parameters

that limit the scalability of the cluster are the network bandwidth and the sustainable request rate of

the dispatcher, which is the only portion of the transaction actually processed by the dispatcher.

DispatcherIP Address=A

Router

Server 1IP alias=A

Server 2IP alias=A

Server 3IP alias=A

1 2 34


Router

Server 1IP alias=A

Server 2IP alias=A

Server 3IP alias=A

1 2 34

28

One restriction on L4/2 clustering is that the dispatcher must have a direct physical connection to all

network segments that house servers (due to layer two frame addressing). This contrasts with L4/3

clustering (Section 2.6.2), where the server may be anywhere on any network with the sole constraint

that all client-to-server and server-to-client traffic must pass through the dispatcher. In practice, this

restriction on L4/2 clustering has little appreciable impact since servers in a cluster are likely to be

connected via a single high-speed LAN.

Among research and commercial products implementing layer two clustering are the ONE-IP

developed at Bell Laboratories [46], the IBM's eNetwork Dispatcher [47], and the LSMAC from the

University of Nebraska-Lincoln (Section 2.10.4).

2.6.2 L4/3 Clustering

In L4/3 based clusters, the dispatcher appears as a single host to the web clients. Figure 11 illustrates

the L4/3 dispatcher that appears as a gateway for the cluster servers. The web client sends traffic to

the address of the web cluster and the traffic arrives to the dispatcher. If the packet received

corresponds to a TCP/IP connection initiation, the dispatcher selects one of the servers in the server

pool to service the request.

Figure 11: The L4/3 clustering model

Similar to L4/2 clustering, the selection of the cluster server relies on a traffic distribution algorithm.

The dispatcher then creates an entry in the connection map noting the origin of the connection, the

chosen server, and other relevant information. However, unlike the L4/2 approach, the dispatcher

rewrites the destination IP address of the packet as the address of the cluster server selected to service

this request. Furthermore, the dispatcher re-calculates any integrity codes affected such as packet

checksums, cyclic redundancy checks, or error correction checks. The dispatcher then sends the

modified packet to the cluster server corresponding to the new destination address of the packet. If the

Dispatcher

Server 1

Server n

Requests

Replies

.

.

.

Dispatcher

Server 1

Server n

Requests

Replies

.

.

.

29

incoming web client traffic is not a connection initiation, the dispatcher examines its connection map

to determine if it belongs to a currently established connection. If it does, the dispatcher rewrites the

destination address as the server previously selected, re-computes the checksums, and forwards the

packet to the cluster server as we described earlier. In the event that the packet does not correspond to

an established connection and it is not a connection initiation packet, then the dispatcher drops the

packet.

The traffic sent from the cluster servers to the web clients travels through the dispatcher since the

source address on the response packets is the address of the particular server that serviced the request,

not the cluster address. The dispatcher rewrites the source address to the cluster address, re-computes

the integrity codes, and forwards the packet to the web client.

Figure 12: The traffic flow in an L4/3 based cluster

Figure 12 illustrates the traffic flow in an L4/3 clustered environment [45]. A web client sends an

HTTP packet with A as the destination IP address (1). The immediate router sends the packet to the

dispatcher (2), since the dispatcher machine is the owner of the IP address A. Based on the traffic

distribution algorithm and the session table, the dispatcher decides to forward this packet to the back-

end server, Server 2 (3). The dispatcher then rewrites the destination IP address as B2, recalculates

the IP and TCP checksums, and sends the packet to B2 (3). Server 2 accepts the packet and replies to

the client via the dispatcher (4), which the back-end server sees as a gateway. The dispatcher rewrites

the source IP address of the replying packet as A, recalculates the IP and TCP checksums, and sends

the packet to the web client (5).

RFC 2391, Load Sharing using IP Network Address Translation, presents the L4/3 clustering

approach [48]. The LSNAT from the University of Nebraska-Lincoln provides a non-kernel space

implementation of the L4/3 clustering approach [49]. Section 2.10.4 discusses the project and the

implementation.


Router

Server 1IP address=B1



1 32

45


Router




1 32

45

30

L4/2 clustering theoretically outperforms L4/3 clustering due to the overhead imposed by L4/3

clustering with the necessary integrity code recalculation coupled with the fact that all traffic must

flow through the dispatcher, resulting that the L4/3 dispatcher processes more traffic than an L4/2

dispatcher does. Therefore, the total data throughput of the dispatcher limits the scalability of the

system more than the sustainable request rate.

2.6.3 L7 Clustering

Level 7 web switch works at the application level. The web switch establishes a connection with the

web client and inspects the HTTP request content to decide about dispatching. The L7 clustering

technique is also known as content-based dispatching since it operates based on the contents of the

client request. The Locality-Aware Request Distribution (LARD) dispatcher developed by the

researchers at Rice University is an example of the L7 clustering. LARD partitions a Web document

tree into disjoint sub-trees. The dispatcher then allocates each server in the cluster one of these sub-

trees to serve. As such, LARD provides content-based dispatching as the dispatcher receives web

clients requests.

Figure 13: The process of content-based dispatching – L7 clustering model

Figure 13 presents an overview of the processing with the L7 clustering approach [45]. Server 1

processes request of type ; Server 2 processes requests of types and . The dispatcher separates

the stream of requests into two streams of requests: one stream for Server with requests of of type ,

and and stream for Server 2 with requests of types and . As requests arrive from clients for the

web cluster, the dispatcher accepts the connection and the request. It then classifies the requested

document and dispatches the request to the appropriate server. The dispatching of requests requires

support from a modified kernel that enables the connection handoff protocol. After establishing the

Dispatcher

Server 1

Server n

.

.

.

c ba a c a

a

b c

a a a

c c b

Dispatcher

Server 1

Server n

.

.

.

c ba a c a

a

b c

a a a

c c b

31

connection, identifying the request, and choosing the cluster server, the dispatcher informs the cluster

server of the status of the network connection, and the cluster server takes over that connection, and

communicates directly with the web client. Following this approach, the LARD allows the file system

cache of each cluster server to cache a separate part of the web tree rather than having to cache the

entire tree, as it is the case with L4/2 and L4/3 clustering. Additionally, it is possible to have

specialized server nodes, where for instance, the dynamically generated content is offloaded to special

compute servers while other requests are dispatched to servers with less processing power. The

LARD requires modifications to the operating system on the servers to support the TCP handoff

protocol.

2.6.4 Discussion of the OSI Layer Clustering Techniques

We broadly classify transparent server clustering into three categories: L4/2, L4/3, and L7. Table 4

summarizes these technologies and highlights their advantages and disadvantages.

L4/2 L4/3 L7

Mechanism Link-layer address

translation

Network address translation Content-based routing

Flows Incoming only Incoming/outgoing Varies

HA and fault

tolerance

Varies, several single point of

failures


failures


failures

Restrictions Incoming traffic passes

through dispatcher

Dispatcher lies between client and

server. All incoming and outgoing

traffic passes through dispatcher

Incoming traffic passes

through dispatcher

Bottleneck Connection dispatching Integrity code calculations Connections dispatching,

dispatcher complexity

Client

information

At TCP/IP level At TCP/IP level TCP/IP information and

HTTP header content

Table 4: Advantages and drawbacks of clustering techniques operating at the OSI layer

Each of the approaches creates bottlenecks that limit scalability, and presents several single points of

failure. For L4/2 dispatchers, system performance is constrained by the ability of the dispatcher to set

up, look up, and tear down entries. However, the most telling performance metric is the sustainable

request rate. The limitation of L4/3 dispatchers is their ability to rewrite and recalculate the

checksums for the massive numbers of packets they process. Therefore, the most telling performance

32

metric is the throughput of the dispatcher. Lastly, the L7 clustering approach has limitations related to

the complexity of the content-based routing algorithm and the size of their cache.

2.7 Clustering Web Servers

Clustering for web servers is a technique in which we group together two or more web servers to

collectively accommodate increases in load and provide system redundancy. Figure 14 illustrates an

example of a web cluster that consists of five cluster nodes. The nodes share access to a common

network, which connects it to the public Internet. The figure does not specify whether or not the

nodes share access to common disk repository, or if they have private disk storage. Web clients are

online users of the system; they connect to the Internet and request data from the web server cluster,

which consists of multiple nodes each running an independent copy of the web server software. We

accomplish clustering using software, hardware, or a combination of both.

Figure 14: A web server cluster

The following sub-sections explore the software and hardware techniques used to build web clusters.

2.7.1.1 Hardware Based Clustering Solutions

Figure 15 illustrates the hardware clustering approach using a network device called the packet router.

The packet router sits in front of a number of web servers and directs incoming HTTP requests to

available web servers in the cluster. The web server software running on cluster nodes have access to

the same storage repository. The packet router distributes traffic to the cluster servers has on a

predefined distribution policy such as round robin. The router device, in conjunction with the web

servers, comprises a virtual server.

The Internet

Web ServerCluster

Node 1

Node 3

Node 4

Node 5

Node 2

Web Clients

The Internet

Web ServerCluster

Node 1

Node 3

Node 4

Node 5

Node 2

Web Clients

33

Load balancing switches, such as the Cisco LocalDirector [50], redirect TCP requests to servers

belonging to a cluster. The LocalDirector provides traffic distribution service by presenting a

common virtual IP address to the web clients and then forwarding their incoming requests to an

available node within the web cluster. Using heartbeat algorithms, these switches frequently ping the

servers in the cluster to ensure they are still available making them a better solution than the basic

round robin DNS [51][83].

Figure 15: Using a router to hide the web cluster

Hardware-based clustering solutions use routers to provide a single IP interface to the cluster and to

distribute traffic among various cluster nodes. These solutions are a proven technology; they are

neither complicated nor complex by design. However, they have certain limitations such as limited

intelligence, un-awareness of the applications running on the cluster nodes, and the presence of

SPOF.

Limited intelligence: Packet routers can load balance in a round robin fashion, and some can detect

failures and automatically remove failed servers from a cluster and redirect traffic to other nodes.

These routers are not fully intelligent network devices. They do not provide application-aware traffic

distribution. While they can redirect requests upon discovering a failure, they do not allow

configuring redirection thresholds for individual servers in a cluster, and therefore, they are unable to

manage load to prevent failures.

Lack of Dynamism: A router cannot measure the performance of a web application server or make an

intelligent decision on where to route the request based on the load of the cluster node and its

hardware characteristics.

Network Attached Storage

Router

Web Cluster Nodes

The Internet

Web Clients

Network Attached Storage

Router

Web Cluster Nodes

The Internet

Web Clients

34

Single point of failure: Packet router constitutes a SPOF for the entire cluster. If the router fails, the

cluster is not accessible to end users and the service becomes unavailable.

2.7.1.2 Software Clustering Solutions

Software solutions are another approach to clustering. A number of different clustering packages exist

ranging from various platform implementations of the domain name system (DNS) to virtual server

implementations that present a cluster of servers to the outside world as a single server. Section 2.6

presented on the OSI layer clustering techniques. This section presents and discusses other software

clustering techniques, in addition to some implementations of the OSI layer techniques.

One method of software clustering is the round robin DNS, which involves setting up the DNS server

of the web site to return the set of all the IP addresses of the servers in the cluster in a different order

on each successive request [51][52]. The client forwards the request to the first IP address in the list

of IP addresses returned. Consequently, the request arrives at a different server in the cluster, and the

traffic is distributed across the cluster servers. This distribution scheme has some disadvantages. First,

the DNS server has no way of knowing if one or more of the servers in the cluster is overloaded, out

of service, or if the application on that server is still up and running. Second, the servers inside the

cluster can be of different hardware configurations (processor, memory, network bandwidth, and

disk) and the DNS is not aware of the heterogonous nature of the nodes inside the cluster. As a result,

the load is not assigned to the server nodes based on their potential capacity, which leads to

overloading some nodes and under-using, others. While round robin DNS is a popular choice because

of its relative simplicity and low implementation cost, it does not demonstrate intelligence in

distributing the traffic load or reacting to node failures. Therefore, it does not help prevent server

overloads and eventual failures for sites with high traffic.

One example of a commercial clustering solution is the Microsoft Cluster Server (MSCS) [53].

MSCS is a built-in feature of Windows NT Server, Enterprise Edition. It supports the connection of

two servers into a "cluster" for higher availability and easier manageability of data and applications. It

supports the active/standby redundancy model in which two cloned systems provide redundancy for

one another. The MSCS can automatically detect and recover from server or application failures. It

can be used to move server workload to balance utilization and allow scheduled maintenance without

downtime. The initial release of MSCS supports clusters with two servers. Microsoft Corporation

announced its plans to release the MSCS Phase 2 in the third quarter of 2006. MSCS Phase 2

35

promises to support larger clusters and to include enhanced services to simplify the creation of highly

scalable, cluster-aware applications [54]. The current version of MSCS suffers from scalability issues

as it only supports two servers that require upgrading as the traffic increases.

The Linux Virtual Server (LVS) is an open source project that aims to provide a high performance

and highly available software clustering implementation for Linux [23]. It implements layer 4

switching in the Linux kernel, providing a virtual server layer built on a cluster of real servers and

allowing TCP and UDP sessions to be load balanced between multiple real servers. The virtual

service is by either an IP address or a port and protocol. The front-end of the real servers is a load

balancer, which schedules requests to the different servers and makes parallel services of the cluster

appear as a virtual service on a single IP address. The architecture of the cluster is transparent to end

users, and the users who only see the address of the virtual server. The LVS is available in three

different implementations [55]: Network Address Translation (NAT), Direct Routing (DR), and IP

tunneling. Sections 3.5.1, 3.5.2, and 3.5.3, present and discuss the NAT, DR, and IP tunneling

methods, respectively.

We have experimented with both the NAT and DR methods. Section 3.5.4 presents the benchmarking

results comparing the performance of both methods. Each of these techniques for providing a virtual

interface to a web cluster has its own advantages and disadvantages. Based on our lab experiments

discussed in Chapter 3, we concluded that the common disadvantage among these schemes is their

limited scalability (Figure 31). When the traffic load increases, the load balancer becomes a

bottleneck for the whole cluster and the local director crashes under heavy load or stops accepting

new incoming requests. In both cases, the local director replies very slowly to ongoing requests.

Software clustering solutions have three main advantages that make them a better alternative than

clustering solutions: flexibility, intelligence, and availability. First, software clustering solutions can

augment existing hardware devices, thereby providing a more robust traffic distribution and failover

solution. Additionally, by integrating hardware with software, you diminish, if not eliminate, losses

on capital expenditures that your organization has already made. Secondly, they provide a level of

intelligence that enables preventive traffic distribution measures that actually minimize the chance of

servers becoming unavailable. In the event that a server becomes overloaded or actually fails, some

software can automatically detect the problem and reroute HTTP requests other nodes in the cluster.

Thirdly, with software clustering solutions, we can support high availability capabilities to avoid

36

single points of failure. An individual server failure does not affect the service availability since

functionalities and failover capabilities are distributed among the cluster servers.

However, we need to consider several issues when evaluating cluster software solutions, mainly the

differences among feature sets, the platform constraints, their HA and scaling capabilities. Software

clustering solutions have different capabilities and features, such as their capabilities of providing

automatic failure detection, notification, and recovery. Some solutions have significantly delayed

failure detection; others allow the configuration of the load thresholds to enable preventive measures.

In addition, they can support different redundancy models such as the 1+1 active/standby, 1+1

active/active, N+M and N-way. Therefore, we need to determine the needs or requirements for

scalability and failover and pick the solution accordingly. In addition, software solutions have limited

platform compatibility; they are available to run on specific operating system or computing

environments. Furthermore, the capability of the clustering solution to scale is important. Some

solutions have limited capabilities restricted to four, eight, or 16 nodes, and therefore have scaling

limitations.

2.8 Scalability in Internet and Web Servers

Scalability is the ability of an application server, such as a web server, to grow to support a large

number of concurrent users without performance degradation. Generally, scalability refers to how

well a hardware or software system can adapt to increased demands. Perfect scalability is linear: if we

double the number of processors in the system, we expect the system to serve double the number of

requests per second it normally serves. We consider this optimum performance. Unfortunately, this is

not the case in real systems.

As the number of online services continues to grow as well as the number of Internet users, Internet

and web servers have to meet new requirements in areas of availability, scalability, performance,

reliability, and security. They have to cope with the explosive growth of the Internet [1][2],

continuous increasing traffic, and meet all expectations in terms of stringent requirements in those

areas. One additional capability includes supporting geographical mirroring for increased high

availability, especially when providing critical services such as electronic banking and stocks

brokering. However, these servers experience scalability problems – they are slow; they are

continuously hacked and attacked by malicious users, and at times they are not available for service,

making businesses liable to losing money when users are not able to access the services [56][57][58].

37

As such, scalability presents itself as a crucial factor for the success or failure of online services and it

is certainly one important challenge faced when designing servers that provide interactive services for

a wide clientele.

Many factors can affect negatively the scalability of systems [59]. The first common factor is the

growth of user base, which cause serious capacity problems for servers that can only serve a certain

number of transactions per second. If the server is not able to cope with the increased number of users

and traffic, the server starts rejecting requests. A second key factor negatively affecting the scalability

of servers is the number and size of data objects, particularly the size of audio and video files strains

the network and I/O capacity causing scalability problems. The increasing amount of accessible data

makes data search, access, and management more difficult, which causes processing problems and

eventually led to rejecting incoming requests. Finally, the non-uniform request distribution imposes

strains on the servers and network at certain times of the day or at certain requested data. These

factors can cause servers to suffer from bottlenecks, and run out of network, processing, and I/O

resources.

2.8.1 Scalability in Telecommunication Servers

Telecommunication servers provide a wide range of IP applications for mobile phones. These servers

require scaling capabilities to support an increasing number of subscribers and services. They support

some of the rigorous requirements in terms of reliability, performance, availability, and scalability

[58][59]. They aim to provide five nines availability, which translates into a maximum of five minutes

and fifteen seconds of downtime per year that includes downtime associated with operating system,

software upgrades, and hardware maintenance. These expectations place an unprecedented burden on

telecommunication equipment manufacturers to ensure that all the elements needed to support a

service are functioning whenever a user requests that particular service.

While achieving a 100 percent of uptime is desirable, it is difficult to achieve [60]. A fault-tolerant

system needs to function correctly, given the small but practically inevitable presence of a fault in the

system. Supporting five nines service availability depends on the near-flawless interaction of

applications, operating system, management middleware, hardware, as well as on environmental and

operational factors.

The requirements of telecommunication servers are getting more complex as the telecom industry is

moving towards an always connected, always online paradigm [61] with a new suite of third

38

generation interactive and multimedia services. These servers suffer from scalability problems as the

number of mobile subscribers is increasing at a fast pace [1]. To cope with the increased number of

users and traffic, mobile operators are resorting to upgrading servers or buying new servers with more

processing power [59], a process that proved to be expensive and iterative. According to Ericsson

Research, the growth rate of mobile subscribers in 2004 was approximately 500,000 users per day

[62]. This raises the question of whether the Mobile Internet servers and the applications running on

those servers will be able to cope with such growth.

When the servers are not able to cope with increased traffic, it results in failure to meet the high

expectations of paying customers who expect services to be available at all times with acceptable

performance levels [63][64], and meeting and managing service level agreements. Service level

agreements dictate the percentage of the time services will be available, the number of users that can

be served simultaneously, specific performance benchmarks to which actual performance will be

periodically compared, and access availability. If ISPs, for instance, are not able to cope with the

increasing number of users, they will break their service level agreements causing them to loose

money and potentially loose customers. Similarly, mobile operators have to deal with huge money

losses if their servers are not available for their subscribers.

2.8.2 How Users Experience Scalability

Variations in the scalability of a system are transparent to the end users. Users experience the capacity

of a system in various ways such as how fast and accurately the system responds to their requests.

The user expects to receive the response in an acceptable time with no errors [13][14][15]. Therefore,

the server needs to adapt to different numbers of users and amounts of data, without resource

problems or performance bottleneck. We can measure these qualities via the system response time,

availability, and reliability of the server.

2.8.2.1 Response time

The response time is the space of that exists between the moment a user gives an input, or posts a

request, to the moment when a user receives an answer from the server. Total response time includes

the time to connect, the time to process the request on the server, and the time to transmit the response

back to the client: Total response time = connect time + process time + response transit time

39

When throughput is low, the response transmit time is insignificant. However, as throughput

approaches the limit of network bandwidth, the server has to wait for bandwidth to become available

before it can transmit the response.

The response time in a distributed system consists of all the delays created at the source site, in the

network, and at the receiver site. The possible reasons for the delays and their length depend on the

system components and the characteristics of the transport media. The response time consists of the

delays in both directions.

2.8.2.2 Availability and Reliability

The availability of a service depends on the reliability of the server and network components

providing the service, and the system architecture. Availability is the percentage of time the system

has been available. Reliability, on the other hand, is the ability of the server to respond accurately to

requests without errors.

The system designer needs to design and build the system so that a failure of one server or network

link does not cause the service to become unavailable. We can avoid such situations by duplicating

the service to several servers and having optional redundant network routings to them. In such a

system, the failure of a server network link only means loss of capacity but the system keeps working.

Since web servers offer interactive services, we expect them to be available to service requests.

Alternatively, when the service is not available, the system needs to recover from faults and with

minimal degradation of service.

2.8.3 Current Scalability Methods for Web Servers

Web servers need to be able to scale and maintain high availability, reliability, and performance as

the amount of simultaneous users and traffic increases. To overcome the overloading problem of web

servers, there are two types of solutions. The first, called the single server, consists of upgrading the

server to a higher performance server. However, the new server will be overloaded when requests

increases so that administrators will have to upgrade it again. The upgrading process is complex and

the cost is high. The second, called the multi-server, consists of building a scalable server on a cluster

of servers relying on clustering techniques. When the load increases, we can add more servers into the

cluster to meet the demands; however, it is not as easy as it sounds. In reality, as we add more servers

into the cluster, the total number of transactions per server declines.

40

The widely deployed scaling methods for clustered web servers are round robin DNS and packet

router device to distribute incoming traffic.

2.8.3.1 Round Robin DNS

The round robin DNS maps a single server name to the different IP address in a Round robin manner.

Following this approach, the round robin DNS distributes the traffic among the servers and maps

clients are to different servers in the cluster. The round robin DNS method is not very efficient.

Because of the caching nature of clients and the hierarchical DNS system, it easily leads to dynamic

load imbalance among the servers, making it unrealistic for a server to handle its peak load.

It is difficult to choose the Time-To-Live (TTL) value of a name mapping: with small values, round

robin DNS is a bottleneck, and with high values, the dynamic load imbalance gets worse. Even when

the TTL value is set to zero, the scheduling granularity is per host; different users' access pattern may

lead to dynamic load imbalance, because some people pull many pages from the site, and others just

surf a number of pages and go away.

Furthermore, round robin DNS is not a reliable mechanism. When a cluster node fails, the clients who

map the name to the IP address discovers that the node is down; the problem persists even if the users

press the reload or refresh button in their web browsers.

2.8.3.2 Load Balancer Network Device

A different scaling method is to use a network device called the load balancer to distribute incoming

traffic among the servers. A load balancer can be implemented either in software, such as the LVS, or

as a hardware solution, such as the Cisco LocalDirector [50]. These solutions generally forward

HTTP requests to the web servers inside the cluster, receive the result, and forward it to the clients.

The service provided by the servers constitutes the web cluster that appears to end-users as a virtual

server made visible through a single IP address. The scheduling granularity is per connection, which

can make a sound traffic distribution among the servers. We can achieve traffic distribution at two

levels: application-level and IP-level. Chapter 3 expands on the software and hardware distribution

methods, and discusses the results of our benchmark testing.

41

2.8.4 Principles of Scalable Architecture

This section discusses the principles of scalable architectures. After presenting the HAS architecture

in Chapter 4, we discuss how the HAS architecture design meets the architectural scaling principle

presented in this section.

We can characterize the applications by their consumption of four primary system resources:

processor, memory, file system bandwidth, and network bandwidth. We can achieve scalability by

simultaneously optimizing the consumption of these resources and designing an architecture that can

grow modularly by adding more resources.

Several design principles are required to design scalable systems. The list includes divide and

conquer, asynchrony, encapsulation, concurrency, and parsimony [65]. Each of these principles

presents a concept that is important in its own right when designing a scalable system. There are also

tensions between these principles; we can sometimes apply one principle at the cost of another. The

root of a solid system design is to strike the right balance among these principles. In the following

subsections, we present on each of these principles.

2.8.4.1 Divide and Conquer

The divide and conquer principle indicates that the system need to be partitioned into relatively small

sub-systems, each carrying out some well-focused function [65]. This permits deployments that can

leverage multiple hardware platforms or simply separate processes or threads, thereby dispersing the

load in the system and enabling various forms of load balancing (or traffic distributing) and tuning.

The divide and conquer principle varies slightly from the concept of modularization, as it addresses

the partitioning of both code and data and it can approach the problem from either side.

2.8.4.2 Asynchrony

The asynchrony principle means that the system carries out the work based on available resources

[65]. Synchronization constrains a system under load because application components cannot process

work in random order, even if resources do exist to do so. Asynchrony decouples functions and lets

the system schedule resources more freely and thus potentially more completely. This principle

allows us to implement strategies that effectively deal with stress conditions such as peak load.

42

2.8.4.3 Encapsulation

The encapsulation principle is the concept of building the system using loosely coupled components,

with little or no dependence among components [65] . This principle often, but not always, correlates

with asynchrony. Highly asynchronous systems tend to have well encapsulated components and vice

versa. Loose coupling means that components can pursue work without waiting for work from others.

2.8.4.4 Concurrency

The concurrency principle means that there are many moving parts in a system and the goal is to split

the activities across hardware, processes, and threads [65]. Concurrency aids scalability by ensuring

that the maximum possible work is active at all times and addresses system load by spawning new

resources on demand within predefined limits. Concurrency also maps directly to the ability to scale

by rolling in new hardware. The more concurrency applications exploit, the better the possibilities to

expand by adding new hardware.

2.8.4.5 Parsimony

The parsimony principle indicates that the designer of the system needs to be economical in what he

or she designs [65]. Each line of code and each piece of state information has a cost, and, collectively,

the costs can increase exponentially. A developer has to ensure that the implementation is as efficient

and lightweight as possible. Paying attention to thousands of micro details in a design and

implementation can eventually pay off at the macro level with improved system throughput.

Parsimony also means that designers carefully use scarce or expensive resources. No matter what

design principle a developer applies, a parsimonious implementation is appropriate. Some examples

include algorithms, I/O, and transactions. Parsimony ensures that algorithms are optimal to the task

since several small inefficiencies can add up and kill performance. Furthermore, performing I/O is

one of the more expensive operations in a system and we need to keep I/O activities to the bare

minimum. Moreover, transactions constrain access to costly resources by imposing locks that prohibit

read or write operations. Applications should work outside of transactions whenever feasible and go

out of each transaction in the shortest time possible.

43

2.8.5 Strategies for Achieving Scalability

Section 2.8.5 presented the five principles of scalable architectures. This section presents the design

strategies to achieve a scalable architecture.

2.8.5.1 System Partitioning

A notable characteristic of a scalable system is its ability to balance the load. As the system scales up

to meet demand, it should do so by maximizing resource utilization. In the case of a clustered web

server, the web traffic is distributed among the cluster nodes.

Partitioning breaks system software into domain components with well-bounded functionality and a

clear interface. Each component defines part of the architectural conceptual model and a group of

functional blocks and connectors. Ultimately, the goal is to partition the solution space into

appropriate domain components that map onto the system topology in a scalable manner.

The principles that apply in this strategy include divide and conquer, asynchrony, encapsulation, and

concurrency.

2.8.5.2 Service Based Layered Architecture

By making services available to a client’s application layer, service-based architectures offer an

opportunity to share a single component across many different systems. Having services front-end

allows a component to offer different access rights and visibility to different clients.

The principles that apply in this strategy include divide and conquer and encapsulation.

2.8.5.3 Keeping it Simple

Donald A. Norman warns in the preface of his book ”The Design of Everyday Things”, “Rule of

thumb: if you think something is clever and sophisticated, beware — it is probably self-indulgence”

[66]. The software development cycle’s elaboration phase is an iterative endeavor. In accordance with

Occam Razor [65], when faced with two design approaches, choose the simpler one. If the simpler

solution proves inadequate to the purpose, consider the more complicated counterpart in the project’s

later stages.

2.9 Overview of Related Work

Server scalability is a recognized research area both in the academia and in the industry and it is an

essential factor in the client/server dominated network environments. Researchers around the world

44

are investigating clusters and commodity hardware as an alternative to expensive specialized

hardware for building scalable Internet and web servers. Although the area of cluster computing is

relatively new, there is an abundance of research projects in the areas of cluster computing and

scalable cluster-based servers. Section 2.10 examines six projects that are close to our work in terms

of scope and goal, discuss their approaches, advantages and drawbacks, discuss their prototypes and

implementations, and present our learned lessons from their experiences.

This section serves as a brief overview of some of the other surveyed work that helped us get a better

understanding of the areas of research related to scalable web servers.

In [63], the authors present and discuss a method for traffic distribution inside a cluster of web servers

that uses distributed packet rewriting.

In [67], the authors address strategies for designing a scalable architecture: divide and conquer

asynchrony, encapsulation, concurrency, and parsimony. The authors argue that the strategies

presented are not comprehensive; however, they represent critical strategies for server-side scaling.

In [68], the authors propose a new scheduling policy, called multi-class round robin (MC-RR), for

web switched operating at the layer 7 of the open system interconnection (OSI) protocol stack to

route requests reaching the web cluster. The authors demonstrate through a set of simulation

experiments that MC-RR is more effective than round robin for web sites providing highly dynamic

services.

In [69], the authors describe their architecture for a web server designed to cope with the ongoing

increase of the Internet requirements. The proposed architecture addresses the need for a powerful

data management system to cope with the increase in the complexity of users’ requests, and a caching

mechanism to reduce the amount of redundant traffic. The author extends the architecture with a

caching system that builds up an adaptive hierarchy of caches for web servers, which allow them to

keep up the changes in the traffic generated by the applications they are running.

In [70], the authors present a comparison of different traffic distribution methods for HTTP traffic in

scalable web clusters. The authors present a classification framework for the different load balancing

methods and compare their performance. In particular, they discuss the rotating name server method

in comparison with alternative load balancing method based on remapping requests and responses in

the network. Their results demonstrate that the remapping requests and responses yield better results

that the rotating name serve method.

45

The researchers at the Korea Advanced Institute of Science and Technology have developed an

adaptive load balancing method that changes the number of scheduling entities according to different

workload [71]. It behaves exactly like dispatcher based scheme with low or intermediate workload,

taking advantage of fine-grained load balancing. When the dispatcher is overloaded, the DNS servers

distribute the dispatching jobs to other entities such as and back-end servers. In this way, they relax

the hot spot of the dispatcher. Based on simulation results, they demonstrated that the adaptive

dispatching method improves the overall performance on realistic workload simulation.

In [72], the authors present and evaluate an implementation of a prototype scalable web server

consisting of a balanced cluster of hosts that collectively accept and service TCP connections. The

host IP addresses are advertised using round robin DNS technique allowing a host to receive requests

from a client. They use a low-overhead technique called the distributed packet rewriting (DPR) to

redirect TCP connections. Each host keeps information about the remaining hosts in the system. Their

performance measurements suggest that their prototype outperforms round robin DNS. However,

their benchmarking was limited to a five-node cluster, where each node reached a peak of 632

requests per second, compared to over 1,000 requests per second per node we achieved with our early

prototype (Section 3.7).

In [45], the authors discuss clustering as a preferred technique to build scalable web servers. The

authors examine early products and a sample of contemporary commercial offerings in the field of

transparent web server clustering. They broadly classify transparent server clustering into three

categories: L4/2, L4/3, and L7 clustering, and discuss their advantages and disadvantages.

In [73], the authors present their two implementations for traffic manipulation inside a web cluster:

MAC-based dispatching (LSMAC) and IP-based dispatching (LSNAT). The authors discuss their

results, and the advantages and disadvantages of both methods. Section 2.10.4 discusses those

approaches.

The researchers from Lucent Technologies and the University of Texas at Austin present in [74] their

architecture for a scalable web cluster. The distributed architecture consists of independent servers

sharing the load through a round robin traffic distribution mechanism.

In [75], the authors present on optimizations to the NCSA http server [76] to make it more scalable

and allow it to serve more requests.

46

2.10 Related Work: In-depth Examination

This section discusses six projects that share the common goal of increasing the performance and

scalability of web clusters. These projects had different focus areas such as traffic distribution

algorithms, new architectures, presenting the cluster as a single server through a virtual IP layer. This

section examines these research projects, presents their respective area of research, their architectures,

highlights their status and plans, and discusses the contributions of their research into our work. The

works discussed are the following:

- “Redirectional-based Web Server Architecture” at University of Texas (Austin): The goal of this

project is to design and prototype a redirectional-based hierarchical architecture that eliminates

bottlenecks in the cluster and allows the administrator to add hardware seamlessly to handle

increased traffic [77]. Section 2.10.1 discusses this project.

- “Scalable policies for Scalable Web clusters” at the University of Roma: The goal of the project

is to provide scalable scheduling policies for web clusters [68][78]. Section 2.10.2 discusses this

project.

- “The Scalable Web Server (SWEB)” at the University of California (Santa Barbara): The project

investigates the issues involved in developing a scalable web server on a cluster of workstations.

The objective is to strengthen the processing capabilities of such servers by utilizing the power of

computers to match the huge demand in simultaneous access requests from the Internet [78].

Section 2.10.3 discusses this project.

- “LSMAC and LSNAT”: The project at the University of Nebraska-Lincoln investigates server

responsiveness and scalability in clustered systems and client/server network environments [79].

The project is focusing on different server infrastructures to provide a single entry into the cluster

and traffic distribution among the cluster nodes [73]. Section 2.10.4 examines the project and its

results.

- “Harvard Array of Clustered Computers (HACC)”: The HACC project aims to design and

prototype cluster architecture for scalable web servers [81]. The focus of the project is on a

technology called “IP Sprayer”, a router component that sits between the Internet and the cluster

and is responsible for traffic distribution among the nodes of the cluster [82]. Section 2.10.5

discusses this project.

- “IBM Scalable and Highly Available Web Server”: This project is investigating scalable and

highly available web clusters. The goal with the project is to develop a scalable web cluster that

47

will host web services on IBM proprietary SP-2 and RS/6000 systems [83]. Section 2.10.6

discusses this project.

2.10.1 Hierarchical Redirection-Based Web Server Architecture

The University of Texas at Austin and Bell Labs are collaborating on a research project to design

scalable web cluster architecture. The authors confirm that a distributed architecture consisting of

independent servers sharing the load is most appropriate for implementing a scalable web server [76].

Their study of currently available approaches, such as RR DNS, demonstrates that these approaches

do not scale effectively because they lead to bottleneck in different parts of the system. This section

presents their redirection-based hierarchical server architecture, discusses its advantages and

drawbacks, and presents their contributions to our work.

The study proposes a redirectional-based hierarchical server architecture that includes two levels of

servers: redirectional servers and normal HTTP servers. The administrator of the cluster partitions

data and stores it on different cluster nodes. The directional servers distribute the requests of the web

users to the corresponding HTTP server.

Figure 16: Hierarchical redirection-based web server architecture

Figure 16 illustrates the architecture of the hierarchical redirection based web server approach. Each

HTTP server stores a portion of the data available at the site. The round robin DNS distributes the

load among the redirection servers [75]. The redirection servers in turn redirect the requests to the

HTTP servers where a subset of the data resides. The redirection mechanism is part of the HTTP

RedirectionalServer 1

RedirectionalServer k….

Clients

Server 1 Server 2 Server n….

Round Robin DNS

RedirectionalServer 1

RedirectionalServer k….

Clients

Server 1 Server 2 Server n….

Round Robin DNS

48

protocol and it is completely transparent to the user. The browser automatically recognizes the

redirection message, derives the new URL from it, and connects to the new server to fetch the file.

The original goal of the redirection mechanism supported in HTTP was to facilitate moving files from

one server to another. When a client uses the old URL from its cache or from the bookmark after, and

if the file referenced by the old URL was moved to a new server, the old server returns a redirection

message, which contains the new URL. The cluster administrator partitions the documents stored at

the site among the different servers based on their content. For instance, server 1 could store stock

price data, while server 2 stores weather information and server 3 stores movie clips and reviews. All

requests for stock quotes are directed to server 1 while requests for weather information are directed

for server 2.

It is possible to implement the architecture described with server software modifications. However, in

order to provide more flexibility in load balancing and additional reliability, there is a need to

replicate contents on multiple servers. Implementing data replication requires modifying the data

structure containing the mapping information. If there is replication of data, a logical file name is

mapped to multiple URLs on different servers. In this case, the redirection server has to choose one of

the servers containing the relevant information data. Intelligent strategies for choosing the servers can

be implemented to better balance the load among the HTTP servers. Many approaches are possible

including round robin and weighted round robin.

Figure 17: Redirection mechanism for HTTP requests

Browser

DNS

RedirectionalServer

HTTPServer

BaseURL

Redirectional message with base URL

TargetURL

File

1

UsersUsers

2

3 45

6Browser

DNS

RedirectionalServer

HTTPServer

BaseURL

Redirectional message with base URL

TargetURL

File

1


2

3 45

6

49

Figure 17 illustrates the steps a web request goes through until the client gets a response back from

the HTTP server. The web user types a web request into the web browser (1). The DNS server

resolves the address and returns the IP address of the server, which in this case is the address of the

redirectional server (2). When the request arrives to the redirectional server (3), it is examined and

forwarded to the appropriate HTTP server (4,5). The HTTP server processes the request and replies to

the web client (6).

The authors implement load balancing by having each HTTP server report its load periodically to a

load monitoring coordinator. If the load on a particular server exceeds a certain threshold, the load

balancing procedure is triggered. Some portions of the content on the overloaded server are then

moved to another server with lower load. Next, the redirection information is updated in all

redirection servers to reflect the data move.

The authors implemented a prototype of the redirection-based server architecture using one

redirectional server and three HTTP servers. Measurements using the WebStone [86] benchmark

demonstrate that the throughput scales up with the number of machines added. Measurements of

connection times to various sites on the Internet indicate that additional connection to the redirection

server accounts for a +20% increase in latency [74].

This architecture is implemented using COTS hardware and server software. Web clients see a single

logical web server without knowing the actual location of the data, or the number of current servers

providing the service. The administrator of the system partitions the document store among the

available cluster nodes however using tedious and mechanical mechanisms and does not provide

dynamic load balancing; rather, it requires the interference of a system administrator to move data to

a different server(s) and to update manually the redirection rules.

One important characteristic of the implementation is the size of the mapping table. The HTTP server

stores the redirection information in a table that is created when the server is started and stored in

main memory. This mapping table is searched on every access to the redirection server. If the table

grows too large, it increases the processing time searching in the redirection server.

The architecture assumes that all HTTP servers have disk storage, which is not very realistic as many

real deployments take advantage of diskless nodes and network storage. The maintenance and update

of all copies of the data is difficult. In addition, web requests require an additional connection

between the redirection server and the HTTP server.

50

Other drawbacks of the architecture include the lack of redundancy at the main redirectional server.

The authors did not focus on incorporating high availability capabilities within the architecture. In

addition, since the architecture assumes a single redirectional server, there was no effort to investigate

a single IP interface to hide all the redirectional servers. As a result, the redirectional server poses a

SPOF and limits the performance and scalability of the architecture. Furthermore, the authors did not

investigate the scaling limitation of the architecture. Overall, the architecture promises a limited level

of scalability.

We classify the main inputs from this project in four essential points. First, the research provided us

with a confirmation that a distributed architecture is the right way to proceed forward. A distributed

architecture allows us to add more servers to handle the increase in traffic in a transparent fashion.

The second input is the concept of specialization. Although very limited in this study, node

specialization can be beneficial where different nodes within the same cluster handle different traffic

depending on the application running on the cluster nodes. The third input to our work relates to load

balancing and moving data between servers. The redirectional architecture achieves load balancing by

manually moving data to different servers, and then updating the redirection information stored on the

redirectional server. This load balancing scheme is an interesting concept for small configurations;

however, it is not practical for large web clusters and we do not consider this approach for our

architecture. The fourth input to our work is the need for a dynamic traffic distribution mechanism

that is efficient and lightweight.

2.10.2 Scheduling Policies for Scalable Web Clusters

The researchers at the University of Roma are investigating the design of efficient and scalable

scheduling algorithms for web dispatchers. The research investigates clusters as scalable web server

platform. The researchers argue that one of the main goals for scalability of distributed web system is

the availability of a mechanism that optimally balances the load over the server nodes [79]. Therefore,

the project focuses on traditional and new algorithms that allow scalability of web server farms

receiving peak traffic. The project recognizes that clustered systems are leading architectures for

building web sites that require guaranteed scalable services when the number of users grows

exponentially. The project defines a web farm as a web site that uses two or more servers housed

together in a single location to handle requests [79].

51

Figure 18: The web farm architecture with the dispatcher as the central component

Figure 18 presents the architecture of the web cluster with n servers connected to the same local

network and providing service to incoming requests. The dispatcher server connects to the same

network as the cluster servers, provides an entry point to the web cluster, and retains transparency of

the distributed architecture for the users [84]. The dispatcher receives the incoming HTTP requests

and distributes it to the back-end cluster servers.

Although web clusters consist of several servers, all servers use one hostname site to provide a single

interface to all users. Moreover, to have a mechanism that controls the totality of the requests

reaching the site and to mask the service distribution among multiple back-end servers, the web

server farm provides a single virtual IP address that corresponds to the address of front-end server(s).

This entity is the dispatcher that acts as a centralized global scheduler that receives incoming requests

and routes them among the back-end servers of the web cluster. To distribute the load among the web

servers, the dispatcher identifies uniquely each server in the web cluster through a private address.

The researchers argue that the dispatcher cannot use highly sophisticated algorithms for traffic

distribution because it has to take fast decision for hundreds of requests per second. Static algorithms

are the fastest solution because they do not rely on the current state of the system at the time of

making the distribution decision. Dynamic distribution algorithms have the potential to outperform

static algorithms by using some state information to help dispatching decisions. However, they

require a mechanism that collects, transmits, and analyzes that information, thereby incurring in

overheads.

Primary DNSURL IP

DispatcherDispatcherLocal Area

Network

Server 1Server 1

Wide Area Network

Server 2Server 2

Server 3Server 3

Server 4Server 4

Server nServer n

Primary DNSURL IP

DispatcherDispatcherLocal Area

Network

Server 1Server 1

Wide Area Network

Server 2Server 2

Server 3Server 3

Server 4Server 4

Server nServer n

52

The research project considered three scheduling policies that the dispatcher can execute [84]:

random (RAN), round robin (RR) and weighted round robin (WRR). The project does not consider

sophisticated traffic distribution algorithms to prevent the dispatcher from becoming the primary

bottleneck of the web farm.

Based on modeling simulations, the project observed that burst of arrivals and skewed service times

alone do not motivate the use of sophisticated global scheduling algorithms. Instead, an important

feature to consider for the choice of the dispatching algorithm is the type of services provided by the

web site. If the dispatcher mechanism has a full control on client requests and clients require HTML

pages or submit light queries to a database, the system scalability is achieved even without

sophisticated scheduling algorithms. In these instances, straightforward static policies are as effective

as their more complex dynamic counterparts are. Scheduling based on dynamic state information

appears to be necessary only in the sites where the majority of client requests are of three or more

orders of magnitude higher than providing a static HTML page with some embedded objects.

The project observes that for web sites characterized with a large percentage of static information, a

static dispatching policy such as round robin provides a satisfactory performance and load balancing.

Their interpretation for this result is that a light-medium load is implicitly balanced by a fully

controlled circular assignment among the server nodes that is guaranteed by the dispatcher of the web

farm. When the workload characteristics change significantly, so that very long services dominate,

the system requires dynamic routing algorithms such as WRR to achieve a uniform distribution of the

workload and a more scalable web site. However, in high traffic web sites, dynamic policies become

a necessity.

The researchers did not prototype the architecture into a real system and run benchmarking tests on it

to validate the performance, scalability, and high availability. In addition, the project did not design or

prototype new traffic distribution algorithms for web servers; instead, it relied on existing distribution

algorithms such as the DNS routing and RAN, RR, and WRR distribution. The architecture presents

several single points of failure. In the event of the dispatcher failure, the cluster becomes unreachable.

Furthermore, if a cluster node becomes unavailable, there is no mechanism is place to notify the

dispatcher of the failure of individual nodes. Moreover, the dispatcher presents a bottleneck to the

cluster when under heavy load of traffic.

53

The main input from this project is that dynamic routing algorithms are a core technology to achieve a

uniform distribution of the workload and a reach scalable web cluster. The key is in the simplicity of

the dynamic scheduling algorithms.

2.10.3 Scalable World Wide Web Server

The Scalable Web server (SWEB) project grew out of the needs of the Alexandria Digital Library

(ADL) project [85] at the University of California at Santa Barbara [78] that has a potential to become

the bottleneck in delivering digitized documents over high speed Internet. For web-based network

information systems such as digital libraries, the servers involve much more intensive I/O and

heterogeneous processor activities. The SWEB project investigates the issues involved in developing

a scalable web server on a cluster of workstations and parallel machines [78]. The objective of the

project is to strengthen the processing capabilities of such servers by utilizing the power of computers

to match the huge demand in simultaneous access requests from the Internet. The project aims to

demonstrate how to utilize inexpensive commodity networks, heterogeneous workstations, and disks

to build a scalable web server, and to attempt to develop dynamic scheduling algorithms for

exploiting task and I/O parallelism adaptive to run time change of system resource load and

availability. The scheduling component of the system actively monitors the usages of processors, I/O

channels, and the interconnection network to distribute effectively HTTP request across cluster nodes.

Figure 19: The SWEB architecture

Internal Network

DNSInternet

Server

Disk

SWEB Logical Server

- Scheduler- Load info- httpd

HTTPRequests

UsersUsers

UsersUsers

UsersUsers

Internal Network

DNSInternet

Server

Disk

SWEB Logical Server

- Scheduler- Load info- httpd

HTTPRequests




54

Figure 19 illustrates the SWEB architecture. The DNS routes the user requests to the SWEB

processors using round robin distribution. The DNS assigns the requests without consulting the

dynamically changing system load information. Each processor in the SWEB architecture contains a

scheduler, and the SWEB processors collaborate with each other to exchange system load

information. After the DNS sends a request to a processor, the scheduler on that processor decides

whether to process this requests or assign it to another SWEB processor. The architecture uses URL

redirection to achieve re-assignment. The SWEB architecture does not allow SWEB servers to

redirect HTTP requests more than once to avoid the ping-pong effect.

Figure 20: The functional modules of a SWEB scheduler in a single processor

Figure 20 illustrates the functional structure of the SWEB scheduler. The SWEB scheduler contains a

HTTP daemon based on the source code of the NCSA HTTP [76] for handling http requests, in

addition to the broker module that determines the best possible processor to handle a given request.

The broker consults with two other modules, the oracle module and the loadd module. The oracle

module is a miniature expert system, which uses a user-supplied table that characterizes the processor

and disk demands for a particular task. The loadd module is responsible for updating the system

processor, network and disk load information periodically (every 2 to 3 seconds), and making the

processors, which have not responded within the time limit, unavailable. When a processor leaves or

joins the resource pool, the loadd module is aware of the change as long as the processor has the

original list of processor that was setup by the administrator of the SWEB system.

The SWEB architecture investigates several concepts. It supports a limited flavor of dynamism while

monitoring the processor and disk usage on processors. The loadd module collects processor and disk

SWEB

.

.accept_request(r)choose_server(s)if (s != me)

reroute_request(r)else

handle_request(r)..

httpd

Broker- Manages servers- chooses sever for request

Oracle- Characterize requests

loadd- Manages distributed load info

loadd

SWEB

.

.accept_request(r)choose_server(s)if (s != me)

reroute_request(r)else

handle_request(r)..

httpd

Broker- Manages servers- chooses sever for request

Oracle- Characterize requests

loadd- Manages distributed load info

loadd

55

usage information and feed back this information to the broker to make better distribution decisions.

The drawback of this mechanism is that it does not report available memory as part of the metrics,

which is as important as the processor information; instead it reports local disk information for an

architecture that relies on a network file system for storage.

The SWEB architecture does not provide high availability features, making it vulnerable to single

points of failures. The oracle module expects as input from the administrator a list of processors in

the SWEB system and the processor and disk demands for a particular task. It is not able to collect

this information automatically. As a result, the administrator of the cluster interferes every time we

need to add or remove a processor.

The SWEB implementation modified the source code of the web server and created two additional

software modules [78]. The implementation is not flexible and does not allow the usage of those

modules outside the SWEB specific architecture.

The researchers have benchmarked the SWEB architecture built using a maximum of four processors

with an in-house benchmarking tool, not using a standardized tool such as WebBench with a

standardized workload. The results of the tests demonstrate a maximum of 76 requests per second for

1 KB/s request size, and 11 requests per seconds for 1.5 MB/s request size, which ranks low

compared to our initial benchmarking results (Section 3.7).

The project contributes to our work by providing us how-to on actively monitoring the usages of

processor, I/O channels, and the network load. This information allows us to distribute effectively

HTTP requests across cluster nodes. Furthermore, the concept of web cluster without master nodes,

and having the cluster nodes provide the services master nodes usually provide, is a very interesting

concept.

2.10.4 University of Nebraska-Lincoln: LSMAC and LSNAT

This research project recognizes that server responsiveness and scalability are important in

client/server network environments [13]. The researchers are considering clusters that use commodity

hardware as an alternative to expensive specialized hardware for building scalable web servers. The

project investigates different server infrastructures namely MAC-based dispatching (LSMAC) and IP-

based dispatching (LSNAT), and focuses on the single IP interface approach [44]. The resulting

implementations are the LSMAC and LSNAT implementations, respectively. LSMAC dispatches

each incoming packet by directly modifying its media access control (MAC) addresses.

56

Figure 21 presents the LSMAC approach [80]. A client sends an HTTP packet (1) with A as the

destination IP address. The immediate router sends the packet to the dispatcher at IP address A (2).

Based on the load sharing algorithm and the session table, the dispatcher decides that this packet

should be handled by the back-end-server, Server 2, and sends the packet to Server 2 by changing the

MAC address and forwarding it (3). Server 2 accepts the packet and replies directly to the client (4).

Figure 21: The LSMAC implementation

Figure 22: The LSNAT implementation

Figure 22 illustrates the LSNAT approach [73][80]. The LSNAT implementation follows RFC 2391

[48]. A client (1) sends an HTTP packet with A as the destination IP address. The immediate router

sends the packet to the dispatcher (2) on A, since the dispatcher machine is assigned the IP address A.

Based on the load sharing algorithm and the session table, the dispatcher decides that this packet

should be handled by the back-end server, Server 2. It then rewrites the destination IP address as

Server 2, recalculates the IP and TCP checksums, and sends the packet to Server 2 (3). Server 2

Router

IP=B1 IP=B2 IP=B3 IP=D

IP_alias=AServer 1

IP_alias=AServer 2

IP_alias=AServer 3

LSMACdispatcher

HTTP requestswith dest IP=A

Route A D

1 23

4

Router


IP_alias=AServer 1

IP_alias=AServer 2

IP_alias=AServer 3

LSMACdispatcher


Route A D

1 23

4

Router


IP_alias=AServer 1

IP_alias=AServer 2

IP_alias=AServer 3

LSMACdispatcher


Route A D

1 23

4

5

Router


IP_alias=AServer 1

IP_alias=AServer 2

IP_alias=AServer 3

LSMACdispatcher


Route A D

1 23

4

5

57

accepts the packet (4) and replies to the client via the dispatcher, which the back-end servers see as a

gateway. The dispatcher rewrites the source IP address of the replying packet as A, and recalculates

the IP and TCP checksums, and send the packet to the client (5).

The dispatcher in both approaches, LSMAC and LSNAT, is not highly available and presents a SPOF

that can lead to service discontinuity. The work did not focus on providing high availability

capabilities; therefore, in the event of a node failure, the node continues to receive traffic. Moreover,

the architecture does not support scaling the number of servers. The largest setup tested was a cluster

that consists of four nodes. The authors did not demonstrate the scaling capabilities of the proposed

architecture beyond four nodes [73][80]. The performance measurements were performed using the

benchmarking tool WebStone [86]. The LSMAC implementation running on a four-node cluster

averaged 425 transactions per second per traffic node. The LSNAT implementation running on a four

nodes cluster averaged 200 transactions per second per traffic node [79].

Furthermore, the architecture does not provide adaptive optimized distribution. The dispatcher does

not take into consideration the load of the traffic nodes nor their heterogeneous nature to optimize its

traffic distribution. It assumes that all the nodes have the same hardware characteristics such as the

same processor speed and memory capacity.

2.10.5 Harvard Array of Clustered Computers (HACC)

The goal of the HACC project is to design and develop a cluster architecture for scalable and cost

effective web servers [81][82]. In [81], the authors discuss the approach that places a router called IP

Sprayer between the Internet and a cluster of web servers.

Figure 23 illustrates the architecture of the web cluster with the IP Sprayer. The IP Sprayer is

responsible for distributing incoming web traffic evenly between the nodes of the cluster. A number

of commercial products, such as the Cisco Local Director [50] and the F5 Networks Big IP [87],

employ this approach to distribute web site requests to a collection of machines typically in a round

robin fashion. The HACC architecture focuses on locality enhancements by dividing the document

store among the cluster nodes and dynamic traffic distribution. Rather than distributing requests in a

round robin fashion, the HACC smart router distributes requests so that as to enhance the inherent

locality of the requested documents in the server cluster.

58

Figure 23: The architecture of the IP sprayer

Figure 24: The architecture with the HACC smart router

Figure 24 illustrates the concept of the HACC smart router. Instead of being responsible for the entire

working set, each node in the cluster is responsible for only a fraction of the document store. The size

of the working set of each node decreases each time we add a node to the cluster, resulting in a more

efficient use of resources per node. The smart router uses an adaptive scheme to tune the load

presented to each node in the cluster based on that node’s capacity, so that it can assign each node a

fair share of the load. The idea of HACC bears some resemblance to the affinity based scheduling

schemes for shared memory multiprocessor systems [88][89], which schedule a task on a processor

where relevant data already resides.

Router (or IP Sprayer)

HTTP Requests

Node 1

aB C a

B C aB C

Web Clients

Node 2 Node 3

Router (or IP Sprayer)

HTTP Requests

Node 1

aB C a

B C aB C

Web ClientsWeb Clients

Node 2 Node 3

HACC Smart Router

HTTP Requests

a B C

Web Clients

Node 1 Node 2 Node 3

HACC Smart Router

HTTP Requests

a B C

Web ClientsWeb Clients

Node 1 Node 2 Node 3

59

2.10.5.1 HACC Implementation

The main challenge in realizing the potential of the HACC design is building the Smart Router, and

within the Smart Router, designing the adaptive algorithms that direct requests at the cluster nodes

based on the locality properties and capacity of the nodes [81].

The smart router implementation consists of two layers: the low smart router (LSR) and the high

smart router (HSR). The LSR corresponds to the low-level kernel resident part of the system and the

HSR implements the high-level user-mode brain of the system. The authors conceived this

partitioning to create a separation of mechanism and policy, with the mechanism implemented in the

LSR and the policy implemented in the HSR.

The Low Smart Router: The LSR encapsulates the networking functionality. It is responsible for

TCP/IP connection setup and termination, for forwarding requests to cluster nodes, and forwarding

the result back to clients. The LSR listens on the web server port for a connection request. When the

LSR receives connection request, TCP passes a buffer to the LSR containing the HTTP request. The

HSR extracts and copies the URL from the request. The LSR queues all data from this incoming

request and waits for the HSR to indicate which cluster node should handle the request. When the

HSR identifies the node, the LSR establishes a connection with it and forwards the queued data over

this connection. The LSR continues to ferry data between the client and the cluster node serving the

request until either side closes the connection.

The High Smart Router: The HSR monitors the state of the document store, the nodes in the cluster,

and properties of the documents passing through the LSR. It uses this information to decide how to

distribute requests over the HACC cluster nodes. The HSR maintains a tree that models the structure

of the document store. Leaves in the tree represent documents and nodes represent directories. As the

HSR processes requests, it annotates the tree with information about the document store to the applied

in load balancing. This information could include node assignment, document sizes, request latency

for a given document, and in general, sufficient information to make an intelligent decision about

which node in the cluster should handle the next document request. When a request for a particular

file is received for the first time, the HSR adds nodes representing the file and newly reached

directories to its model of the document store, initializing the file’s node with its server assignment.

In the current prototype, incoming new documents are assigned to the least loaded server node. After

the first request for a document, subsequent requests go to the same server and though improve the

locality of references.

60

Dynamic Load Balancing: Dynamic load balancing is implemented using Windows NT’s

performance data helper (PDH) interface [90]. The PDH interface allows collecting a machine’s

performance statistics remotely. When the smart router initializes, it spawns a performance

monitoring thread that collects performance data from each cluster node at a fixed interval. The HSR

then uses the performance data for load balancing in two ways. First, it identifies a least loaded node

and assigns new requests to it. Second, when a node becomes overloaded, the HSR tries to offload a

portion of the documents for which the overloaded node is responsible to the least loaded node.

2.10.5.2 HACC Evaluation and Lessons Learned

There are several drawbacks to the HACC architecture. The architecture is vulnerable to SPOF, has

limited performance and scalability, and suffers from various challenges in aspects such as data

distributions, in addition to specific implementation issues.

Firstly, the smart router presents itself as a SPOF. In the HACC architecture if the smart router

becomes unavailable because of a software or hardware error, the HACC cluster becomes unusable.

Secondly, the performance and scalability of the HACC cluster with the smart router scalability are

limited. The benchmarking tests demonstrate that the prototype of the smart router is capable of

handling between 400 and 500 requests per second (requests are of size 8 KB) [81]. These results are

half the performance that our early prototype achieved with its worst-case scenarios, using the round

robin distribution scheme (Section 3.7). The limited performance of the smart router suggests that the

HACC cluster cannot scale because of the limitation imposed by the bottleneck at the smart router

level. In addition, the prototype of the HACC architecture consists of one node acting as the smart

router and three nodes acting as cluster nodes, suggesting a limited configuration. Furthermore, the

HACC architecture relies on distributing data among the different cluster nodes; we cannot perform

the distribution while the HACC cluster is in operation. This approach is not realistic for systems that

are in operation and require an update to their data store online. In addition, the tree-structured name

space only works for the case when the structure of the document store is hierarchical. Moreover, the

Keep Alive feature of the HTTP poses some potential problems with the Smart Router. If we enable

the Keep Alive option, the browser allows reusing the TCP connection for subsequent requests that

the smart router does not intercept, which interfere with the traffic distribution decision of the smart

router.

61

2.10.6 IBM Scalable and Highly Available Web Server

IBM Research is investigating the concept of a scalable and highly available web server that offers

web services via a Scalable Parallel (SP-2) system, a cluster of RS/6000 workstations. The goal is to

support a large number of concurrent users, high bandwidth, real time multimedia delivery, fine-

grained traffic distribution, and high availability. The server will provide support for large multimedia

files such as audio and video, real time access to video data with high access bandwidth, fine-grained

traffic distribution across nodes, as well as efficient back-end database access. The project is focusing

on providing efficient traffic distribution mechanisms and high availability features. The server

achieves traffic distribution by striping data objects across the back-end nodes and disks. It achieves

high availability by detecting node failures and reconfiguring the system appropriately. However,

there is no mention of the time to detect the failure and to recover.

Figure 25: The two-tier server architecture

Figure 25 illustrates the architecture of the web cluster. The architecture consists of a group of nodes

connected by a fast interconnect. Each node in the cluster has a local disk array attached to it. The

disks of a node either can maintain a local copy of the web documents or can share it among nodes.

Switch

Back-End Nodes

Load Balancing

…

…

Disks

Communication

Front-End Nodes

App: Application specific software

External network connection

UsersUsers UsersUsers UsersUsers

App App App

Switch

Back-End Nodes

Load Balancing

…

…

Disks

Communication

Front-End Nodes

App: Application specific software

External network connection

UsersUsersUsersUsers UsersUsersUsersUsers UsersUsersUsersUsers

AppApp AppApp AppApp

62

The nodes of the cluster are of two types: front-end (delivery) nodes and back-end (storage) nodes.

The round robin DNS is used to distribute incoming requests from the external network to the front-

end nodes, which also run httpd daemons. The logical front-end node then forwards the required

command to the back-end nodes that have the data (document), using a shared file system. Next, the

httpd daemons send the results to the front-end nodes through the switch and then the results are

transmitted to the user. The front-end nodes run the web daemons and connect to the external

network. To balance the load among them, the client spread the load across nodes using RR DNS

[51]. All the front nodes are assigned a single logical name and the RR DNS maps the name to

multiple IP addresses.

Figure 26: The flow of the web server router

Figure 26 illustrates another approach for achieving traffic distribution. One or more nodes of the

cluster serve as TCP routers, forwarding client requests to the different frond-end nodes in the cluster

in a round robin order. The name and IP address of the router is public, while the addresses of the

other nodes in the cluster are private. If there is more than one router node, a single name is used and

the round robin maps the name to the multiple routes. The flow of the web server router (Figure 26) is

as follows. When a client sends requests to the router node (1), the router node forwards (2) all

packets belonging to a particular TCP connection to one of the server front-end nodes. The router can

use different algorithms to select which node to route to, or use round robin scheme. The server nodes

directly reply to the client (3) without using the router. However, the server nodes change the source

Switch

Web Server Nodes

TCP Router Nodes

Internet 1

2

3

Switch

Web Server Nodes

TCP Router Nodes

Internet 1

2

3

63

address on the packets sent back to the client to be that of the router node. The back-end nodes host

the shared file system used by the front-ends to access the data.

There are four main drawbacks preventing the architecture from achieving a scalable and highly

available web cluster: limited traffic distribution performance, limited scalability, lack of high

availability capabilities, the presence of several SPOF, and the lack of a dynamic feedback

mechanism.

The architecture relies on round robin DNS to distribute traffic among server nodes. The scheme is

static, does not adjust based on the load of the cluster nodes, and does not accommodate the

heterogeneous nature of the cluster nodes. The authors proposed an improved traffic distribution

mechanism [83] that involves changing packet headers but still relies on round robin DNS to

distribute traffic among router server nodes. The concept was prototyped with four front-end nodes

and four back-end node. The project did not demonstrate if the architecture is capable of scaling

beyond four traffic nodes and if failures at node level are observed and accommodated for

dynamically. The architecture does not provide features that allow service continuity. The switch as

shown in Figure 25 and Figure 26 is a SPOF. The network file system where data resides is also

vulnerable to failures and presents another SPOF. Furthermore, the architecture does not support a

dynamic feedback loop that allows the router to forward traffic depending on the capabilities of each

traffic node.

2.10.7 Discussion

The surveyed projects share some common results and conclusions. Current server architectures do

not provide the needed scalability to handle large traffic volumes and a large number of web users.

There seem to be a consensus among surveyed literature for a need to design new server architecture

that is able to meet our visions for next generation Internet servers. A distributed architecture that

consists of independent servers sharing the load is more appropriate than single server architectures

for implementing a scalable web server.

Surveyed projects such as [63], [64], [65], [68], [72], [91], [92], [93], [94], and [95], discuss that

using clustering technologies help increase the performance and scalability of the web server.

Clustering is the dominant technology that will help us achieve better scalability and higher

performance. The focus is not on clustering as a technology, rather, the focus is on using this

technology as a mean to achieve scalability, high capacity, and high availability. Several of the

64

surveyed projects such as [68], [79], [81], and [82], focused on providing efficient traffic distribution

mechanisms and providing high availability features that enable continuous service availability.

However, looking at current benchmarking results, adding high availability, and fault tolerance

features is negatively affecting the performance and scalability of the cluster. Traffic distribution is an

essential aspect towards achieve highly scalable platform. Hardware-base traffic distribution solutions

are not scalable; they constitute both a performance bottleneck and a SPOF.

There is a clear need to provide a software single Virtual IP layer that hides the cluster nodes and

make them transparent to end users. There is a need to have an unsophisticated design and

consequently a simple implementation. The uncomplicated design allows a smooth integration of

many components into a well-defined architecture. The benefits are a lightweight design and a faster

and more robust system. One interesting observation is that the surveyed works, for the exception of

the HACC project [81], use the Linux operating system to prototype and implement Internet and web

servers.

A highly available and scalable web cluster requires is a combination of a smart and efficient traffic

distribution mechanism that distributes incoming traffic from the cluster interface to the least busy

nodes in the cluster based on a dynamic feedback loop. The distribution mechanism and the cluster

interface should provide neither a bottleneck nor a SPOF. To scale such cluster, we would like to

have the ability to add nodes into the cluster without disruption of the provided services and the

capability to increase the number of nodes to meet traffic resulting in close to linear scalability.

65

Chapter 3 Preparatory Work

This chapter describes the preparatory work conducted as part of the early investigations. It describes

the prototyped web server cluster, presents the benchmarking environment, and reports the

benchmarking results.

3.1 Early Work

As part of the early experimental work, I have studied the HTTP protocol [96] and the Apache web

server as case studies [97][98]. These studies resulted in the two technical reports, [99] and [100].

Apache is the most popular web server software in the world, running over 60% of the web servers

[101], and licensed under the GNU General Public License (GPL) [102] that grant users full access to

source code. We divide the experimental work into two tasks: understanding and building a web

server, and implementing a performance measurement tool that is able to execute performance and

scalability tests on web servers. The first task involved understanding the working of web servers, in

addition to setting up a web server at the research lab in Concordia. This task was a necessary step

leading to the study of web server design and performance. This work resulted in three published

technical reports: [96], [99], and [100]. The second task consisted of designing and implementing the

X-Based Web Servers Performance Tool (XWPT) that generates web traffic to a web server and

collects performance metrics such as the number of successful results per second and the throughput

[103].

3.2 Description of the Prototyped Web Cluster

The goal of this exercise was to provide a test platform for a highly available and scalable web server

cluster, and to benchmark a running web cluster to learn how the cluster scales, versus simulating

traffic into a model. Figure 27 illustrates the prototyped web cluster. The cluster consists of 14

Pentium III processors running at 500 MHz with 512 MB of RAM each. The cluster processors (also

called nodes) run the Linux operating system [22]. Each server has six Ethernet ports: two onboard

ports and four ports offered on an auxiliary CompaqPCI card for redundancy purposes. Each of the

66

eight processors has three SCSI disks with a capacity of 54 GB in total; the other eight processors are

diskless. All processors have access to a common storage volume provided by the master nodes.

Although we have used the NFS, we also experienced with the Parallel Virtual File System (PVFS)

provide a shared disk space among all nodes. The PVFS supports high performance I/O over the web

[91][92][105][106]. To provide high availability, we implemented redundant Ethernet connections,

redundant network file systems, in addition to software RAID [104]. The LVS [23] provides a single

IP interface for the cluster and provides HTTP traffic distribution mechanism among the servers in

the cluster. As for the web server software, we run Apache release 2.08 [24] and Tomcat release 3.1

[25] on traffic nodes. Other studies and benchamrking we performed in this area include [99], [100],

[103], [107], and [108].

Figure 27: The architecture of the prototyped web cluster

As for the local network, all cluster nodes are interconnected using redundant dedicated links. Each

node connects to two networks through two Ethernet ports over two redundant network switches.

External network access is restricted to master nodes unless traffic nodes require direct access, which

is configurable for some scenarios such as direct routing. The dotted connection indicates that the

processor connects to LAN 1. The solid connection indicates that the processor connects to LAN 2.

The components and parameters of the prototyped system include two master nodes and 12 traffic

nodes, a traffic distribution mechanism, a storage sub-system, and local and external networks. The

master nodes provide a single entry point to the system through the virtual IP address receiving

incoming web traffic and distributing it to the traffic nodes using the traffic distribution algorithm.

Master Node 1 Master Node 2

Traffic Node 1

Traffic Node 2

Traffic Node 3

Traffic Node 4

Traffic Node 5

Traffic Node 6

Traffic Node 7

Traffic Node 8

LAN 1

LAN 2

Traffic Node 9

Traffic Node 10

Traffic Node 11

Traffic Node 12

LVS Virtual IP Interface

HA NFS

RedundantLANS

A single entry to the cluster

Master nodes providingStorage and cluster services

Master Node 1Master Node 1 Master Node 2Master Node 2

Traffic Node 1Traffic Node 1






Traffic Node 7

Traffic Node 8

LAN 1

LAN 2


Traffic Node 10Traffic

Node 10Traffic Node 11Traffic Node 11


LVS Virtual IP Interface

HA NFS

RedundantLANS

A single entry to the cluster

Master nodes providingStorage and cluster services

67

The master nodes provide cluster-wide services such as I/O services through NFS, DHCP, and NTP

services. They respond to incoming web traffic and distribute it among the traffic nodes. The traffic

nodes run the Apache web server and their primary responsibility is to respond to web requests. They

rely on master nodes for cluster services. We used the LVS in different configurations to distribute

incoming traffic to the traffic nodes. Section 3.5 discusses the network address translation and direct

routing methods, and demonstrates their capabilities.

3.3 Benchmarking Environment

We used web server benchmarking to evaluate the performance and scalability of the prototyped web

cluster. Web benchmarking is a combination of standardized tests that consist of a mechanism to

generate a controlled stream of web requests to the cluster with standard metrics to report results. The

following subsections describe the hardware and software components of the benchmarking

environment.

3.3.1 Hardware Benchmarking Environment

We used 16 Intel Celeron 500 MHz machines, called web client machines, to generate web traffic to

the prototype web cluster illustrated in Figure 27. Each of the web client machines is equipped with

512 MB of RAM and run Windows NT. In addition, the benchmarking environment requires a

controller machine that is responsible for collecting and compiling the test results from all the web

client machines. The controller machine has a Pentium III processors running at 500 MHz with 512

MB of RAM.

3.3.2 Software Benchmarking Environment

The web client machines generate traffic using WebBench [19], a freeware benchmarking tool that

simulates web browsers. When a web client receives a response from the web server, WebBench

records the information associated with the response and immediately sends another request to the

server.

Figure 28 presents the WebBench architecture. WebBench runs on a server running the controller

program and one or more servers, each running the client program. The controller provides means to

set up, start, stop, and monitor the WebBench tests. It is also responsible for gathering and analyzing

the data reported from the clients. The web client machines execute the WebBench tests and send

requests to the web server.

68

WebBench supports up to 1,000 web clients generating traffic to a target web server. WebBench

stresses a web server by using a number of test systems to request URLs. We can configure each

WebBench test system to use multiple threads to make simultaneous web server requests. By using

multiple threads per test system, it is possible to generate a large load on a web server to stress it to its

limit with a reasonable number of test systems. Each WebBench test thread sends an HTTP request to

the web server and waits for the reply. When the reply comes back, the test thread immediately makes

a new HTTP request. WebBench makes peak performance measurements that illustrate the limitations

of a web server platform.

Figure 28: The architecture of the WebBench benchmarking tool

The workload tree provided by WebBench contains the test files the WebBench client access when

we execute a test suite. WebBench workload tree is the result of studying real-world sites such as

Microsoft, USA Today, and the Internet Movie Database. The tree uses multiple directories and

different directory depths. It contains over 6,200 static pages and executable test files. The WebBench

provides static (STATIC.TST) and dynamic test suites (WBSSL.TST). The static suites use HTML

and GIF files. The dynamic suites use applications that run on the server.

WebBench keeps at run-time all the transaction information and uses this information to compute the

final metrics presented when the tests are completed. The standard test suites of WebBench begin

with one client and add clients incrementally until they reach a maximum of 60 clients (per single

client machine). WebBench provides numerous standardized test suites. For our testing purposes, we

executed a mix of the STATIC.TST (90%) and WBSSL.TST (10%) tests. Each test takes on average

two hours and a half to run this combination of test suites.

Client machines generating web traffic

Target ClusteredSystem

WebBench Controller Machine gathering results and generating statistics

Client machines generating web traffic

Target ClusteredSystem

WebBench Controller Machine gathering results and generating statistics

69

3.4 Web Server Performance

In considering the performance of a web server, we should pay special regard to its software,

operating system, and hardware environment, because each of these factors can dramatically

influence the results. In a distributed web server, this environment is complicated further by the

presence of multiple components, which require connection handoffs, process activations, and request

dispatching. A complete performance evaluation of all layers and components of a distributed web

server system is very complex impossible. Hence, a benchmarking study needs to define clearly its

goals and scope. In our case, the goal is to evaluate the end-to-end performance of a web cluster. Our

main interests do not go into the hardware and operating system that in many cases are given.

Web server performance refers to the efficiency of a server when responding to user requests

according to defined benchmarks. Many factors affect a server performance such as application

design and construction, database connectivity, network capacity and bandwidth, and hardware server

resources. In addition, the number of concurrent connections to the web server has a direct impact on

its performance. Therefore, the performance objectives include two dimensions: the speed of a single

user's transaction and the amount of performance degradation related to the increasing number of

concurrent connections.

Metric Name Description

Throughput The rate at which data is sent through the network, expressed in Kbytes per second (KB/s)

Connection rate The number of connections per second

Request rate The number of client requests per second

Reply rate The number of server responses per second

Error rate The percentage of errors of a given type

DNS lookup time The time to translate the hostname into the IP address

Connect time The time interval between sending the initial SYN and the last byte of a client request and

the receipt of the first byte of the corresponding response

Latency time In a network, latency, a synonym for delay, is an expression of how much time it takes for a

packet of data to get from one designated point to another

Transfer time The time interval between the receipt of the first response byte and the last response byte

Web object response time The sum of latency time and transfer time

Web page response time The sum of web object response time pertaining to a single web page, plus the connect time

Session time The sum of all web page response times and user think time in a user session

Table 5: Web performance metrics

70

Table 5 presents the common metrics for web system performance. In reporting the results of the

benchmarking tests, we report the connection rate, number of successful transactions per second, and

the throughput as KB/s.

3.5 LVS Traffic Distribution Methods

The LVS is an open source project to cluster many servers together into one virtual server. It

implements a layer four switching in the Linux kernel that allows the distribution of TCP and UDP

sessions between multiple real servers. The real servers, also called traffic nodes, interconnect

through either a local area network or a geographically dispersed wide area network. The front-end of

the real servers is the LVS director. The LVS director is the load balancing engine, and it runs on the

master cluster processor(s). It provides IP-level traffic distribution to make parallel services of the

cluster appear as a virtual service on a single IP address. All requests come to the front-end LVS

director, owner of the virtual IP address, and then the LVS director distributes the traffic among the

real servers. LVS provides three implementations of its IP load-balancing techniques based on NAT,

IP tunneling, and DR. We experimented with both the NAT and DR methods. We did not test the IP

tunneling method for two reasons: the implementation of the IP tunneling method is still experimental

and it does not offer added advantage over the DR method.

The following sub-sections discuss all three methods, with a focus on NAT and DR, and present the

results of benchmarking a cluster with a NAT and then a DR configuration.

3.5.1 LVS via NAT

Network address translation relies on manipulating the headers of Internet protocols appropriately so

that clients believe they are contacting one IP address, and servers at different IP addresses believe

the clients are contacting them directly. LVS uses this NAT feature to build a virtual server; parallel

services at the different IP addresses can appear as a virtual service on a single IP address via NAT. A

hub or a switch interconnects the master node(s), which run LVS, and the real servers. The real

servers usually run the same service and provide access to the same contents, available on a shared

storage device through a distributed file system.

Figure 29 illustrates the process of address translation. The user first accesses the service provided by

the server cluster (1). The request packet arrives at the load balancer through the external IP address.

The load balancer examines the packet's destination address and port number (2). If they match a

71

virtual server service (according to the virtual server rule table), then the scheduling algorithm, round

robin by default, chooses a real server from the cluster to serve the request, and adds the connection

into the hash table which records all established connections. The load balancer server rewrites the

destination address and the port of the packet to match those of the chosen real server, and forwards

the packet to the real server. The real server processes the request (3) and returns the reply to the load

balancer. When an incoming packet belongs to this connection and the established connection exists

in the hash table, the load balancer rewrites and forwards the packet to the chosen server. When the

reply packets come back from the real server to the load balancer, the load balancer rewrites the

source address and port of the packets (4) to those of the virtual service, and submits the response

back to the client (5). The LVS removes the connection record from the hash table when the

connection terminates or timeouts.

Figure 29: The architecture of the LVS NAT method

3.5.2 LVS via DR

The direct routing method allows the real servers and the load balancer server(s) to share the virtual

IP address. The load balancer has an interface configured with the virtual IP address. The load

balancer uses this interface to accept request packets and directly route them to the chosen real server.

Figure 30 illustrates the architecture of the LVS following the direct routing method. When a user

Load BalancerLinux Box

Requests

Switch/Hub

Real Server 1

Internet/Intranet

Users

Real Server 2

Real Server N

1

Scheduling and rewriting packets

2

Rewritingreplies

4

Replies5

Processingthe request

3

Load BalancerLinux Box

Requests

Switch/Hub

Real Server 1

Internet/Intranet

Users

Real Server 2

Real Server N

11

Scheduling and rewriting packets

22

Rewritingreplies

44

Replies55


33

72

accesses a virtual service provided by the server cluster (1), the packet destined for the virtual IP

address arrives to the load balancer. The load balancer examines (2) the packet's destination address

and port. If it matches a virtual service, the scheduling algorithm chooses a real server (3) from the

cluster to serve the request and adds the connection into the hash table that records connections. Next,

the load balancer forwards the request to the chosen real server. If new incoming packets belong to

this connection and the chosen server is available in the hash table, the load balancer directly routes

the packets to the real server. When the real server receives the forwarded packet, the server finds that

the packet is for the address on its alias interface or for a local socket, so it processes the request (4)

and returns the result directly to the user (5). The LVS removes the connection record from the hash

table when the connection terminates or timeouts.

Figure 30: The architecture of the LVS DR method

3.5.3 LVS via IP Tunneling

IP Tunneling allows packets addressed to an IP address to be redirected to another address, possibly

on a different network. In the context of layer four switching, the behavior is very similar to that of

the DR method, except that when packets are forwarded they are encapsulated in an IP packet, rather

than just manipulating the Ethernet frame.

Linux Director

Requests

Real Server 1

Internet/Intranet

Real Server 2

Real Server N

1

Examine packetdestination

2

Virtual IP Address

Reply going to the user directly5


4

InternalNetwork

Forward request to real server

3

.

.

.

Users

Linux Director

Requests

Real Server 1

Internet/Intranet

Real Server 2

Real Server N

11

Examine packetdestination

22

Virtual IP Address

Reply going to the user directly55


44

InternalNetwork

Forward request to real server

33

.

.

.

Users

73

The main advantage of using tunneling is that the real servers (i.e. traffic nodes) can be on a different

network. We did not experiment with the IP Tunneling method because of the unstable status of the

implementation, and because it does not provide additional capabilities over the DR method.

However, we present it for completion purpose.

3.5.4 Direct Routing versus Network Address Translation

We performed a test to benchmark the same web cluster using the NAT approach for traffic

distribution and then then using the direct routing approach. With direct routing, the LVS load

balancer distributes traffic to the cluster processors, which in turn would reply directly to the web

clients, without going through address translation. We run the same test on the same cluster running

Apache, however, using different traffic distribution mechanisms.

Figure 31 presents the results in terms of successful requests per second. Following the NAT

approach, the load balancer node saturates at approximately 2,000 requests per second compared to

4,700 requests per second with the DR method.

0500

100015002000250030003500400045005000

1_cli

ent

4_cli

ents

8_cli

ents

12_c

lients

16_c

lients

20_c

lients

24_c

lients

28_c

lients

32_c

lients

36_c

lients

40_c

lients

44_c

lients

48_c

lients

52_c

lients

56_c

lients

60_c

lients

Req

uest

s pe

r Sec

ond

LVS Direct Routing LVS Network Address Translation

Figure 31: Benchmarking results of NAT versus DR

74

In both tests, the bottleneck occurs at the load balancer node that was unable to accept more traffic

and distribute it to the traffic servers. Instead, the LVS director was rejecting incoming connections

resulting in unsuccessful requests. This test demonstrates that the DR approach is more efficient than

the NAT approach and allows better performance and scalability. In addition, it demonstrates the

bottleneck at the director level of the LVS.

3.6 Benchmarking Scenarios

The goal of the benchmarking activities was to demonstrate performance and scalability issues in

clustered web servers, to uncover these issues and to demonstrate where and when bottlenecks occur

and when the system stops scaling. We carried out six benchmark tests, starting with one processor

and scaling up to 12 processors. The benchmark of a single server is to demonstrate the limitation of a

single server. We also carried out a benchmarking to test demonstrate scalability limitation in NAT

and DR methods.

We performed two sets of benchmarking tests on the prototyped cluster (Figure 27): benchmark a

single processor to determine the baseline performance, and several benchmarks while scaling the

prototyped cluster from two to 12 processors. We performed the first benchmarking test on a single

processor running the Apache web server software. The goal with this test is to reveal the limitations

of a single server. We use the results of this test to define the baseline performance and use it as a

basis for comparison. For the second set of testing, we scaled the cluster to two, four, six, eight, 10,

and 12 traffic nodes at a time. Each node is running its copy of the Apache web server software. The

benchmarking tool, WebBench, generates the web traffic to the LVS IP layer of the cluster that is

running on the master nodes. Since the LVS DR method is more efficient and scalable than its NAT

equivalence (Figure 31), we used it as the distribution mechanism for the incoming traffic. The LVS

load director performs traffic distribution among all the cluster nodes. The goal with this test is to

identify the performance and scaling trend as we add more processors into the cluster.

3.7 Apache Performance Test Results

We conducted benchmarking tests on one processor running Apache to determine the baseline

performance on a single processor. Figure 32 presents the number of web clients on the x-axis and the

number of successful requests per second on the y-axis.

75

0

200

400

600

800

1000

1200

1_clie

nt

4_clie

nts

8_clie

nts

12_c

lients

16_c

lients

20_c

lients

24_c

lients

28_c

lients

32_c

lients

36_c

lients

40_c

lients

44_c

lients

48_c

lients

52_c

lients

Number of Clients

Req

uest

s Pe

r Se

cond

Requests Per Second

Figure 32: Benchmarking results of the Apache web server running on a single processor

0

1000

2000

3000

4000

5000

6000

7000

1_clie

nt

4_clie

nts

8_clie

nts

12_c

lients

16_c

lients

20_c

lients

24_c

lients

28_c

lients

32_c

lients

36_c

lients

40_c

lients

44_c

lients

48_c

lients

52_c

lients

Number of Clients

Thro

ughp

ut (K

byte

s/Se

c)

Throughput (Kbytes/Sec)

Figure 33: Apache reaching a peak of 5,903 KB/s before the Ethernet driver crashes

76

Apache served 1,053 requests per second before it suddenly stops servicing incoming requests; in

fact, as far as the WebBench tool, Apache crashed (Figure 32). We thought that the Apache server has

crashed under heavy load. However, that was not the case. The Apache server process was still

running when we logged locally into the machine. It turned out that the Ethernet device driver crashed

and caused the processor to disconnect from the network and became unreachable. Figure 33

illustrates the throughput achieved on one processor (5,903 KB/s), before the processor disconnects

from the network dues to the device driver crash. We investigated the device driver problem, fixed it,

and made the updated source code publicly available. We did not face the driver crash problem in

further testing. Figure 34 presents the benchmarking result of Apache on a single processor after

fixing the device driver problem. Apache served an average of 1,043 requests per second.

0

200

400

600

800

1000

1200

1_clie

nt

4_clie

nts

8_clie

nts

12_c

lients

16_c

lients

20_c

lients

24_c

lients

28_c

lients

32_c

lients

36_c

lients

40_c

lients

44_c

lients

48_c

lients

52_c

lients

Number of Clients

Requ

ests

Per

Sec

ond

Requests Per Second

Figure 34: Benchmarking results of Apache on one processor – post Ethernet driver update

Next, we setup the cluster in several configurations with two, four, six, eight, 10 and 12 processors

and we performed the benchmarking tests. In these benchmarks, the LVS was forwarding the HTTP

traffic to the traffic nodes following the DR distribution method.

Figure 35 presents the results of the benchmarking test we performed on a cluster with two processors

running Apache. The average number of requests per second per processor is 945.

77

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1_clie

nt

4_clie

nts

8_clie

nts

12_c

lients

16_c

lients

20_c

lients

24_c

lients

28_c

lients

32_c

lients

36_c

lients

40_c

lients

44_c

lients

48_c

lients

52_c

lients

56_c

lients

60_c

lients

Number of Clients

Requ

ests

Per

Sec

ond

Requests Per Second

Figure 35: Results of a two-processor cluster (requests per second)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

1_clie

nt

4_cli

ents

8_cli

ents

12_c

lients

16_c

lients

20_c

lients

24_c

lients

28_c

lients

32_c

lients

36_c

lients

40_c

lients

44_c

lients

48_c

lients

52_c

lients

56_c

lients

60_c

lients

Number of Clients

Req

uest

s Pe

r Se

cond

Requests Per Second

Figure 36: Results of a four-processor cluster (requests per second)

78

Figure 36 presents the results of the benchmarking test we performed on a cluster with four

processors running Apache. The average number of requests per second per processor is 1,003.

0

1000

2000

3000

4000

5000

6000

7000

8000

1_clie

nt

4_clie

nts

8_clie

nts

12_c

lients

16_c

lients

20_c

lients

24_c

lients

28_c

lients

32_c

lients

36_c

lients

40_c

lients

44_c

lients

48_c

lients

52_c

lients

56_c

lients

60_c

lients

64_c

lients

68_c

lients

Number of Clients

Requ

ests

Per

Sec

ond

Requests Per Second

Figure 37: Results of eight-processor cluster (requests per second)

Figure 37 presents the results of the benchmarking test we performed on a cluster with eight

processors running Apache. The average number of requests per second per processor is 892.

Table 6 presents the results of Apache benchmarking for all the cluster configurations including the

single standalone node.

Processors in the cluster Maximum requests per second Transaction per second per processor

1 1053 1053

2 1890 945

4 4012 1003

6 5847 974

8 7140 892

10 7640 764

12 8230 685

Table 6: The results of benchmarking with Apache

79

For each cluster configuration, we present the maximum performance achieved by the cluster in terms

of requests per second; the third column presents the average number of requests per second per

processor. As we add more processors into the cluster, the number of requests per second per

processor decreases (Table 6, 3rd column). Section 3.9 discusses the scalability of the prototyped web

cluster.

3.8 Tomcat Performance Test Results

Similarly, we conducted the benchmarking tests with Tomcat running on one, two, four, six, eight, 10

and 12 processors. Figure 38 presents the results of the benchmarking test we performed on a cluster

with two processors running Tomcat. The average number of requests per second per processor is 76.

0

20

40

60

80

100

120

140

160

1_cli

ent

4_cli

ents

8_cli

ents

12_c

lients

16_c

lients

20_c

lients

24_c

lients

28_c

lients

32_c

lients

36_c

lients

40_c

lients

44_c

lients

48_c

lients

52_c

lients

56_c

lients

60_c

lients

64_c

lients

72_c

lients

80_c

lients

Number of Clients

Requ

ests

Per

Sec

ond

Requests Per Second

Figure 38: Results of Tomcat running on two processors (requests per second)

Figure 39 presents the results of the benchmarking test we performed of a system with four

processors running Tomcat. The average number of requests per second per processor is 75. Figure 40

presents the results of the benchmarking test we performed of a system with eight processors running

Tomcat. The average number of requests per second per processor is 71.

80

0

50

100

150

200

250

300

350

1_cli

ent

4_cli

ents

8_cli

ents

12_c

lients

16_c

lients

20_c

lients

24_c

lients

28_c

lients

32_c

lients

36_c

lients

40_c

lients

44_c

lients

48_c

lients

52_c

lients

56_c

lients

60_c

lients

64_c

lients

72_c

lients

80_c

lients

88_c

lients

96_c

lients

Number of Clients

Requ

ests

Per

Sec

ond

Requests Per Second

Figure 39: Results of a four-processor cluster running Tomcat (requests per second)

050

100150200250300350400450500

1_cl

ient

4_cl

ient

s

8_cl

ient

s

12_c

lient

s

16_c

lient

s

20_c

lient

s

24_c

lient

s

28_c

lient

s

32_c

lient

s

36_c

lient

s

40_c

lient

s

44_c

lient

s

48_c

lient

s

52_c

lient

s

56_c

lient

s

60_c

lient

s

64_c

lient

s

72_c

lient

s

80_c

lient

s

88_c

lient

s

96_c

lient

s

104_

clie

nts

112_

clie

nts

Number of Clients

Requ

ests

per

Sec

ond

Requests per Second

Figure 40: Results of an eight-processor cluster running Tomcat (requests per second)

81

Table 7 presents the results of testing the prototyped cluster running the Tomcat application server.

For each cluster configuration, we present number of processors in the cluster, the maximum

performance achieved by the cluster in terms or requests per second, and the average number of

requests per second per cluster processor.

Number of processors in

the cluster

Cluster maximum requests per

second

Transaction per second per

processor

1 81 81

2 152 76

4 300 75

6 438 73

8 568 71

10 700 70

12 804 67

Table 7: The results of benchmarking with Tomcat

3.9 Scalability Results

We collected benchmarking results achieved with both Apache (Table 6) and Tomcat (Table 7) for

clusters with one, two, four, six, eight, 10 and 12 processors. For each testing scenario, we recorded

the maximum number of requests per second that each configuration can service. When we divide

this number by the number of processors, we get the maximum number of request that each processor

can process per second in each configuration.

Figure 41 and Figure 42 illustrate the transaction capability per processor plotted against the cluster

size. In both figures, the curve is not flat; as we add more processors into the cluster, the number of

successful transactions per second per processor drops. In the case of Apache, when the cluster scaled

from two to 12 processors, the number of successful transactions per second per processor drops by a

factor of -35%. In the case of Tomcat, when the cluster scales from two to 12 processors, the number

of successful transactions per second per processor drops by a factor of -18%. In both cases with

Apache and Tomcat, as we add more processors into the cluster, we experience a decrease in

performance and scalability. The more processors we have in the cluster, the less performance we get

per processor.

82

Figure 41: Scalability chart for clusters consisting of up to 12 nodes running Apache

Figure 42: Scalability chart for clusters consisting of up to 12 nodes running Tomcat

1053945

1003 974892

764685

0

200

400

600

800

1000

1200

1 2 3 4 5 6 7

Number of processors in the cluster

Num

ber o

f tra

nsac

tions

per

sec

ond

per p

roce

ssor

Transactions per second per processor

1 2 4 6 8 10 12

1053945

1003 974892

764685

0

200

400

600

800

1000

1200

1 2 3 4 5 6 7


Num

ber o

f tra

nsac

tions

per

sec

ond

per p

roce

ssor

Transactions per second per processor

1 2 4 6 8 10 12

8176 75 73 71 70

67

0

10

20

30

40

50

60

70

80

90

1 2 3 4 5 6 7


Num

ber o

f tra

nsac

tions

per

sec

ond

per

proc

esso

r

Transaction per second per processor

1 4 6 8 10 12 2

83

3.10 Discussion

Web servers have a limited capacity in serving incoming requests. In the case of Apache, the capacity

limit is around 1,000 requests per second when running on a single processor. Beyond this threshold,

the server starts rejecting incoming requests.

We have demonstrated non-linear scalability with clusters up to 12 nodes running Apache and

Tomcat web servers. In the case of Apache for instance, when we scale the cluster from two

processors to 12 processors, the number of successful requests per second per processor drops from

1,053 to 685, down by -35%. These results present major performance degradation. Theoretically, as

we add more processors into the cluster, we would like to achieve linear scalability and maintain the

baseline performance of 1,000 requests per second per processor.

We experimented with the NAT and DR traffic distribution approaches. The NAT approach, although

widely used, has limited performance and scalability compared to the DR approach as demonstrated

in Section 3.5.4.

Our results demonstrate that the bottleneck occurs at the master node because of bottlenecks at the

master node level and because of the inefficiencies in the traffic distribution mechanism. We plan to

propose an enhanced method based on the DR approach for our highly available and scalable

architecture. The planned improvements include a running daemon on all traffic nodes that reports the

node load to the distribution mechanism to allow it to perform a more efficient and dynamic

distribution.

We experienced some problems with the Ethernet device drivers crashing with high traffic load

(throughput of 5,903 KB/s – Figure 33). We improved the device driver code and as a result, it is now

able to sustain a higher throughput. We contributed the improvements back to the Ethernet card

provider (ZNYX Networks) and to the open source community.

As far as the benchmarking tests, it would have helped to include other metrics such as processor

utilization as well as file system and disk performance metrics to provide more insight on bottlenecks.

These metrics are not available in WebBench, and therefore the only way to have those metrics is to

either implement a separate tool ourselves or use an existing tool.

We acknowledge that the performance of the network file system has certain effects on the total

performance of the system since the network file system hosts the web documents shared between all

traffic nodes. However, the performance of the file system is out of our scope.

84

3.11 Contributions of the Preparatory Work

We examined current ways of solving scalability challenges and demonstrated scalability problems in

a real system through building a web cluster and benchmarking it for performance and scalability.

Many factors differentiate our early experimental work from others. We did not rely on simulation

models to define our system and benchmark it. Instead, we followed a systematical approach building

the web cluster using existing system components and following best practices. In many instances, we

have contributed system software such as Ethernet and NFS redundancy and introduced

enhancements to existing implementations. Similarly, we built our benchmarking environment from

scratch and we did not rely on simulation models to get performance and scalability results. This

approach gave us much flexibility and allowed us to test many different configurations in a real world

setting. One unique aspect of our experiments, which we did not see in related work (Sections 2.9 and

2.10), is the scale of the benchmarking environment and the tests we conducted. Our early prototyped

cluster consisted of 12 processors and our benchmarking environment consisted of 17 machines.

Surveyed projects were limited in their resources and were not able to demonstrate the negative

scalability effects we experienced when reaching 12 processors, simply because they only tested for

up to eight processors. In addition, other works relied on simulation to get a feeling of the how their

architecture would perform. In contrast, we performed our benchmarking using an industry-

standardized tool and workload. The benchmarking tool, WebBench, uses standardized work loads

and is capable of generating more traffic and compiling more test results and metrics than the other

available tools used in the surveyed work such as SURGE [109], S-Clients [110], WebStone [111],

SPECweb99 [112], and TCP-W [113]. The paper comparing these benchmarking tools is available

from [114]. Furthermore, the entire server environment uses open source technologies allowing us

access to the source code and granting the freedom to modify it, and introduce changes to suite our

needs.

85

Chapter 4 The Architecture of the Highly Available and Scalable Web Server

Cluster

This chapter describes the architecture of the highly available and scalable (HAS) web server cluster.

The chapter reviews of the initial requirements discussed in previous chapters, and then presents the

HAS cluster architecture, its tiers and characteristics. It discusses the architecture components,

presents they interact with each other, illustrates the supported redundancy models, and discusses the

various types of cluster nodes and their characteristics. Furthermore, the chapter includes examples of

sample deployments of the HAS architecture as well as case studies that demonstrate how the

architecture scales to support increased traffic. The chapter also addresses the subject of fault

tolerance and the high availability features. It discusses the traffic management scheme responsible

for dynamic traffic distribution and discuss the cluster virtual IP interface which presents the HAS

architecture as a single entity to the outside world. The chapter concludes with the scenario view of

the HAS architecture and examines several use cases.

4.1 Architectural Requirements

The architecture provides the infrastructure that allows the platform to meet the requirements of

Internet and web servers running in environments that require high availability and high performance.

In our context, the architecture requirements for a highly available and scalable web cluster fall in the

following categories: scalability, minimizing response time, providing reliable storage, continuous

service availability, high performance, high reliability of service tolerance, and supporting new the

Internet Protocol version 6 (IPv6).

4.1.1 Scalability

The expansion of the user base requires scaling the capacity of the infrastructure to be in line with

demand. However, with a single server, this means upgrading the server vertically by adding more

memory, or replacing the processor(s) with a faster one. Each upgrade brings the service down for the

duration of the upgrade, which is not desirable. In some cases, the upgrade involves a full deployment

on entirely new hardware resulting in extended downtime. Clustering an application allows us to scale

it horizontally by adding more servers into the service cluster without necessitating service downtime.

However, even when utilizing clustering techniques to scale, the performance gain is not linear.

86

4.1.2 Minimal Response Time

Achieving a minimal response time is a crucial factor for the success of distributed web servers.

Many studies argue that 0.1 second is about the limit for having the user feel that the system is

reacting instantaneously [13][14][15]. Web users expect the system to process their requests and to

provide responses quickly and with high data access rates. Therefore, the server needs to minimize

response times to live up to the expectations of the users. The response time consists of the connect

time, process time and response transit time. The goal is to minimize all of these three parameters

resulting in faster total response time.

4.1.3 Reliable, High Capacity and High Performance Storage

The web server has to provide reliable, high performance and highly available storage. Storage is

essential to save and maintain application data, and to provide fast data retrieval.

4.1.4 Continuous Service Availability

Web servers have to provide a connection that is highly available and reliable. The larger the number

of users and the greater the volume of data the server handles, the more difficult it becomes to

guarantee the availability of the service under high loads [17]. All web server components need be

highly available and redundant; otherwise, they present a SPOF. While there are many definitions for

high availability, in this context, it is the ability of the web cluster to provide continuous service even

in the event of failures of individual components.


The web server has to sustain a guaranteed number of requests per second. It needs to maximize

resource utilization and serve the maximum number of transactions per second that a single web

server node can handle. Section 3.7 presents on the base performance a web server should sustain

under high load of traffic.

4.1.6 High Reliability and Fault Tolerance

High reliability and fault tolerance are essential characteristics for highly available web servers that

provide continuous online services. Several challenges arise in web clusters in relation to high

reliability and fault tolerance such as preventing software or hardware failures from interrupting the

provided service.

87

4.1.7 Supporting the IPv6

The IPv6 is the next generation Internet protocol designed by the Internet Engineering Task Force

(IETF) as a replacement of the Internet protocol version 4 (IPv4). Despite its age, IPv4 has been

remarkably resilient; however, it is beginning to demonstrate problems in various features areas. Its

most visible shortcoming is the growing shortage of IPv4 addresses needed by the growing number of

devices connecting to the Internet. Other limitations in this area include quality of service, security,

auto-configuration, and mobility aspects. To maintain future competitiveness and to transition into

IPv6-only Internet, web clusters need to support IPv6 at the operating system and at application

layer.

4.2 Overview of the Challenges

There are many challenges to address in order to achieve high availability and linear scalability for

web clusters. The following subsections discuss the challenges in the specified areas, and provide a

background on the problems we need to address with the HAS architecture.


High performance in terms of high throughput and maximizing the number of requests per second are

important characteristic of web servers. The challenges in this area are to enhance response time and

provide the maximum possible throughput in term of KB/s. Section 3.7 establish the baseline

performance of a single processor ranging between 1,000 and 1,050 requests per second. Our goal

with the HAS cluster is to maintain this level of performance and throughput as we scale the number

of processors in the cluster.

4.2.2 Continuous Service Availability

In an environment providing critical Internet and web services, it is necessary to provide non-stop

operations and eliminate single points of failure. A SPOF exists if a cluster component providing a

cluster function fails leading to service discontinuity. Potential single points of failure in a cluster

include the following cluster components: master and traffic nodes, cluster software components,

cluster support services, networks and network adapters, and storage nodes and disks. The HAS

architecture supports various redundancy capabilities and provides functionalities to eliminate single

points of failures. Section 4.7 discusses these capabilities. Furthermore, we need the capability of

88

monitoring the availability of the web server application running on the traffic nodes and ensuring

that the application is up and running. Otherwise, the master nodes will forward traffic to a node that

has an unresponsive web server application. Section 4.20 discusses this functionality.

4.2.3 Cluster Single Interface

One of the challenges with clusters that consist of multiple nodes is representing the cluster nodes as a

single entity to the outside world. For instance, one common scenario is to have thousands of users

requesting web pages over the network from a web cluster that consists of multiple web server nodes.

This creates the necessity for a cluster single interface that is scalable to support high number of

connections and highly available so that it does not present a bottleneck or a SPOF. As a result, the

challenge is the ability to allow a virtually infinite amount of clients to reach a virtually infinite

amount of web servers presented as a single virtual IP address. Section 4.21 addresses this challenge.

4.2.4 Efficient and Dynamic Traffic Distribution

A cluster consists of multiple nodes of heterogeneous hardware receiving thousands of requests per

second. The challenge in this area is to distribute the incoming web traffic efficiently amongst all

cluster nodes, taking into consideration the characteristics of each node. Section 4.23 addresses this

challenge, and discusses the solution of the HAS architecture.

4.2.5 Support for IPv6

The main challenge in this area is to support IPv6 across the architecture and all of its software

components such as the traffic distribution mechanism, the cluster virtual interface, application

servers, and Ethernet redundancy daemons. In addition, we need to support IPv6 at the kernel level on

all cluster nodes, and support IPv6 routing and domain name system and name lookup. Section 4.28

presents the support for IPv6 in the HAS architecture.

4.3 The HAS Architecture

The HAS architecture consists of a collection of loosely coupled computing elements, referred to as

nodes or processors, that form what appears to users as a single highly available web cluster. There

are no shared resources between nodes with the exception of storage and access to networks. The

HAS architecture allows the addition of nodes to the cluster to accommodate increased traffic,

89

without performance degradation and while maintaining the level of performance for up to 16

processors (Chapter 5).

Figure 43 illustrates the conceptual model of the architecture showing the three tiers of the

architecture and the software components inside each tier. In addition, it shows the supported

redundancy models per tier. For instance, the high availability (HA) tier supports the 1+1 redundancy

model (active/active and active/standby) and can be expanded to support the N+M redundancy model,

where N nodes are active and M nodes are standby. Similarly, the scalability and service availability

(SSA) tier supports the N-way redundancy model where all traffic nodes are active and servicing

requests. This tier can be expanded as well to support the N+M redundancy model. Sections 4.9, 4.10,

and 4.11 discuss the supported redundancy models.

The HAS architecture is composed of three logical tiers: the high availability (HA) tier, the scalability

and service availability (SSA) tier, and the storage tier. This section presents the architecture tiers at a

high level.

The high availability tier: This tier consists of front-end systems called master nodes. Master nodes

provide an entry-point to the cluster acting as dispatchers, and provide cluster services for all HAS

cluster nodes. They forward incoming web traffic to the traffic nodes in the SSA tier according to the

scheduling algorithm. Section 4.5.1 covers the characteristics of this tier. Section 4.17.1 presents the

characteristics of the master nodes. Section 4.9 discusses the supported redundancy models of the HA

tier.

The scalability and service availability tier: This tier consists of traffic nodes that run application

servers. In the event that all servers are overloaded, the cluster administrator can add more nodes to

this tier to handle the increased workload. As the number of nodes increases in this tier, the cluster

throughput increases and the cluster is able to respond to more traffic. Section 4.6.2 describes the

characteristics of this tier. Section 4.17.2 presents the characteristics of the traffic nodes. Section 4.10

discusses the supported redundancy models of the SSA tier.

The storage tier: This tier consists of nodes that provide storage services for all cluster nodes so that

web servers share the same set of content. Section 4.6.3 describes the characteristics of this tier and

Section 4.17.3 describes the characteristics of the storage nodes. Section 4.11 presents the supported

redundancy models of the storage tier. The HAS cluster prototype did not utilize specialized storage

nodes. Instead, it utilized a contributed extension to the NFS to support HA storage.

90

Figure 43: The HAS architecture

Shar

ed S

tora

ge

B

CC

P1

CC

P2

1+

1 R

ed

un

dan

cy M

od

el

(can

be e

xp

an

ded

to

)N

+M

Red

un

dan

cy M

od

el

Leg

en

dC

CP1

C

lust

er C

om

munic

ation P

ath

1 (

LAN

1)

NTP

Net

work

Tim

e Pr

oto

col

CCP2

Clu

ster

Com

munic

ation P

ath

1 (

LAN

2)

TM

Tra

ffic

Man

ager

VIP

Clu

ster

Vir

tual IP

Lay

erN

FSN

etw

ork

File

Ser

ver

DH

CP

D

ynam

ic H

ost

Configura

tion P

roto

col

TCD

Tra

ffic

Clie

nt

Daem

on

RAD

IPv6

Route

r Adve

rtis

em

ent

Daem

on

HBD

H

eart

bea

t m

echan

ism

Eth

D

Eth

ernet

Redundancy

Dae

mon

CCM

Clu

ster

Configura

tion M

anag

erRCM

Red

undancy

Configura

tion M

anager

Ldirec

torD

Linux

Direc

tor

Dae

mon

Hig

h A

vail

ab

ilit

y (

HA

) Tie

rS

cala

bil

ity a

nd

Serv

ice

Availab

ilit

y (

SS

A)

Tie

rS

tora

ge T

ier

(Op

tio

nal)

Act

ive

Tra

ffic

Node

1

Applic

atio

nSer

ver

Red

un

dan

t Im

ag

e

Serv

ers

C L U S T E R V I P

Man

ag

em

en

t C

on

sole

Shar

ed S

tora

ge A

Dis

k A

Dis

k A

Dis

k B

Dis

k B

Dis

k N

Dis

k N

. .

.

Dis

k A

’D

isk A

’

Dis

k B

’D

isk B

’

Dis

k N

’D

isk N

’

. .

.

Ldir

ecto

rD

Eth

DTCD

En

d U

sers

Ext

ernal

Com

ponen

ts t

o t

he

HAS A

rchitec

ture

Act

ive

Tra

ffic

Node

2

Applic

atio

nSer

ver

Ldir

ecto

rD

Eth

DTCD

Act

ive

Tra

ffic

Node

N

Applic

atio

nSer

ver

Ldir

ecto

rD

Eth

DTCD

Sta

ndby

Tra

ffic

Node

Applic

atio

nSer

ver

Ldir

ecto

rD

Eth

DTCD

N-W

ay R

ed

un

dan

cy M

od

el

(can

be e

xp

an

ded

to)

N+

M R

ed

un

dan

cy M

od

el

CVIP

TM

NTP

Eth

DRCM

saru

DH

CP

HBD

NFS

CCM

RAD

Act

ive

Prim

ary

Mast

er N

ode

CVIP

TM

NTP

Eth

DRCM

saru

DH

CP

HBD

NFS

CCM

RAD

Hot

Sta

ndby

Mas

ter

Node

CVIP

TM

NTP

Eth

DRCM

saru

DH

CP

HBD

NFS

CCM

RAD

Sta

ndby

Mas

ter

Nod

e

Shar

ed S

tora

ge

B

CC

P1

CC

P2

1+

1 R

ed

un

dan

cy M

od

el

(can

be e

xp

an

ded

to

)N

+M

Red

un

dan

cy M

od

el

Leg

en

dC

CP1

C

lust

er C

om

munic

ation P

ath

1 (

LAN

1)

NTP

Net

work

Tim

e Pr

oto

col

CCP2

Clu

ster

Com

munic

ation P

ath

1 (

LAN

2)

TM

Tra

ffic

Man

ager

VIP

Clu

ster

Vir

tual IP

Lay

erN

FSN

etw

ork

File

Ser

ver

DH

CP

D

ynam

ic H

ost

Configura

tion P

roto

col

TCD

Tra

ffic

Clie

nt

Daem

on

RAD

IPv6

Route

r Adve

rtis

em

ent

Daem

on

HBD

H

eart

bea

t m

echan

ism

Eth

D

Eth

ernet

Redundancy

Dae

mon

CCM

Clu

ster

Configura

tion M

anag

erRCM

Red

undancy

Configura

tion M

anager

Ldirec

torD

Linux

Direc

tor

Dae

mon

Hig

h A

vail

ab

ilit

y (

HA

) Tie

rS

cala

bil

ity a

nd

Serv

ice

Availab

ilit

y (

SS

A)

Tie

rS

tora

ge T

ier

(Op

tio

nal)

Act

ive

Tra

ffic

Node

1

Applic

atio

nSer

ver

Red

un

dan

t Im

ag

e

Serv

ers

C L U S T E R V I P

Man

ag

em

en

t C

on

sole

Shar

ed S

tora

ge A

Dis

k A

Dis

k A

Dis

k B

Dis

k B

Dis

k N

Dis

k N

. .

.

Dis

k A

’D

isk A

’

Dis

k B

’D

isk B

’

Dis

k N

’D

isk N

’

. .

.

Ldir

ecto

rD

Eth

DEth

DTCD

TCD

En

d U

sers

En

d U

sers

Ext

ernal

Com

ponen

ts t

o t

he

HAS A

rchitec

ture

Act

ive

Tra

ffic

Node

2

Applic

atio

nSer

ver

Ldir

ecto

rD

Eth

DEth

DTCD

TCD

Act

ive

Tra

ffic

Node

N

Applic

atio

nSer

ver

Ldir

ecto

rD

Eth

DEth

DTCD

TCD

Sta

ndby

Tra

ffic

Node

Applic

atio

nSer

ver

Ldir

ecto

rD

Eth

DEth

DTCD

TCD

N-W

ay R

ed

un

dan

cy M

od

el

(can

be e

xp

an

ded

to)

N+

M R

ed

un

dan

cy M

od

el

CVIP

TM

TM

NTP

NTP

Eth

DEth

DRCM

RCM

saru

saru

DH

CP

DH

CP

HBD

HBD

NFS

NFS

CCM

CCM

RAD

RAD

Act

ive

Prim

ary

Mast

er N

ode

CVIP

TM

TM

NTP

NTP

Eth

DEth

DRCM

RCM

saru

saru

DH

CP

DH

CP

HBD

HBD

NFS

NFS

CCM

CCM

RAD

RAD

Hot

Sta

ndby

Mas

ter

Node

CVIP

TM

TM

NTP

NTP

Eth

DEth

DRCM

RCM

saru

saru

DH

CP

DH

CP

HBD

HBD

NFS

NFS

CCM

CCM

RAD

RAD

Sta

ndby

Mas

ter

Nod

e

91

4.4 HAS Architecture Components

Each of the HAS architecture tiers consists of several nodes, and each node runs specific software

component. A software component (or system software) is a stand-alone set of code that provides

service either to users or to other system software. A component can be internal to the cluster and

represents a set of resources contained on the cluster physical nodes; a component can also be

external to the cluster and represents a set of resources that are external to the cluster physical nodes.

Components can be either software or hardware components. Core components are essential to the

operation of the cluster. On the other hand, optional components are used depending on the usage and

deployment model of the HAS cluster.

The HAS architecture is flexible and allows administrators of the cluster to add their own software

components. The following sub-sections present the components of the HAS architecture, categorize

the components as internal or external, discuss their capabilities, functions, input, output, interfaces,

and describe how they interact with each other.

4.4.1 Architecture Internal Components

The internal components of the architecture include the master nodes, traffic nodes, storage nodes,

routers, and local networks.

Master nodes are members of the HA tier. They provide a key entry into the system through the

cluster virtual IP interface, and act as a dispatcher, forwarding incoming traffic from the cluster

virtual IP interface to the traffic nodes located in the SSA tier. Moreover, master nodes also provide

cluster services to the cluster nodes such as DHCP, NTP, TFTP, and NFS. Section 4.17.1 discusses

the characteristics of master nodes.

Traffic nodes reside in the SSA tier of the HAS architecture. Traffic nodes run application servers,

such as a web server, and they are responsible for replying to clients requests. Section 4.17.2

discusses the characteristics of traffic nodes.

Storage nodes are located in the storage tier, and provide HA shared storage using multi-node access

to redundant mirrored storage. Storage nodes are optional nodes. If the administrators of the HAS

cluster choose not to deploy specialized storage nodes, they can instead use a modified

implementation of the NFS providing HA capabilities.

The architecture supports two local networks, also called cluster communication paths, to provide

connectivity between all cluster nodes. Each of the networks connects to a different router.

92

4.4.2 Architecture External Components

External components to the cluster represent a set of resources that are external to the cluster physical

nodes and include external networks, to which the cluster is connected, the management console

through which we administer the cluster, and web users. External networks connect the cluster nodes

to the Internet or the outside world. The management console is an external component to the cluster

through which the cluster administrator logs in and performs cluster management operations. Web

users, also called web clients, are the service requesters. A user can be a human being, an external

device, or another computer system. Image servers are optional external components to the HAS

architecture. We can use master nodes to provide the functionalities provided by the image servers

(DHCP and TFTP services for booting and installation of traffic nodes). However, we recommend

dedicating the resources on master nodes to serve incoming traffic.

4.4.3 Architecture Software Components

The architecture consists of several system software components. Some of the components are

essential to the operation of the cluster and run by default. However, we expect that across all

possible deployments of highly available web clusters, there are trade-offs between available

functions, acceptable levels of system complexity, security needs, and administrator preferences.

The HAS system software components include: traffic client daemon (TCD), cluster virtual IP

interface (the routed process), traffic manager daemon (TM), IPv6 router advertisement daemon,

DHCP daemon, NTP daemon, HA NFS server daemon, TFTP daemon, heartbeat service (HBD), the

Linux director daemon (LDirectord), the saru module (connection synchronization manager), and the

Ethernet redundancy daemon (EthD).

The traffic client daemon is a core system software that runs on all traffic nodes. It computes the

load_index of traffic nodes, and reports it to the traffic managers running on the master nodes. The

traffic client daemon allows the traffic manager running on master nodes to provide efficient traffic

distribution based on the load of each traffic node. The metrics collected by the traffic client daemon

are the processor load and memory usage. The implementation of the traffic client supports IPv6.

Section 4.23.5 discusses the traffic client daemon.

The cluster virtual IP interface (CVIP) is a core system component that runs on the master nodes in

the HA tier of the HAS architecture. The CVIP masks the HAS architecture internals, and makes it

appear as a single server to the external users, who are not be aware of the internals of the cluster such

93

as how many nodes exist or where the applications run. The CVIP allows a virtually infinite number

of clients to reach a virtually infinite number of servers presented as a single virtual IP address,

without impact on client or server applications. The CVIP operates at the IP level, enabling

applications that run on top of IP to take advantage of the transparency it provides. The CVIP

supports IPv6 and is capable of handling incoming IPv6 traffic. Section 4.21 discusses the cluster

virtual IP interface.

The traffic manager daemon (TM) is a core system component that runs on the cluster master nodes.

The traffic manager receives the load index of the traffic nodes from the traffic client daemons,

maintains the list of available traffic nodes and their load index, and executes the distribution of

traffic to the traffic nodes based on the defined distribution policy in its configuration file. The current

traffic manager implementation supports round robin and the HAS distribution. However, the traffic

manager can support more policies. The traffic manager supports IPv6. Section 4.23.3 discusses the

traffic manager.

The cluster configuration manager (CCM) is a system software that manages all the configuration

files that control the operation of the HAS architecture software components. It provides a centralized

single access point for editing and managing all the configuration files. For the purpose of this work,

we did not implement the cluster configuration manager. However, it is a high priority future work.

At the time of publication, with the HAS architecture prototype, we maintain the configuration files if

the various software components on the network file system.

The redundancy configuration manager (RCM) is responsible for switching the redundancy

configuration of each cluster tier from one redundancy configuration to another, such as from the 1+1

active/standby to the 1+1 active/active. It is also responsible for switching service between

components when the cluster tiers follow the N+M redundancy model. Therefore, it should be aware

active nodes in the cluster and their corresponding standby nodes. For the purpose of this work, we

did not implement the redundancy configuration manager. Section 6.2.4 discusses the RMC as a

future work item.

The IPv6 router advertisement daemon is an optional system software component that is used only

when the HAS architecture need to support IPv6. It offers automatic IPv6 configuration for network

interfaces for all cluster server nodes. It ensures that all the HAS cluster nodes can communicate with

each other and with network elements outside the HAS architecture over IPv6.

94

The cluster administrator has the option of using the DHCP daemon (an optional server service) to

configure IPv4 addresses to the cluster server nodes.

The NTP service is a required system service used to synchronize the time on all cluster server nodes.

It is essential to the operation of other software components that rely on time stamps to verify if a

node is in service or not. Alternatively, we can use a time synchronization service provided by an

external server located on the Internet. However, this poses security risks and it is not recommend it.

As for storage, we provide an enhanced implementation of the Linux kernel NFS server

implementation that supports NFS redundancy and eliminates the NFS server as a SPOF. Section 4.16

discusses the storage models and the various available possibilities.

The TFTP service daemon is an optional software component used in collaboration with the DHCP

service to provide the functionalities of an image server. The image server provides an initial kernel

and ramdisk image for diskless server nodes within the HAS system. The TFTP daemon supports

IPv6 and is capable of receiving requests to download kernel and ramdisk images over IPv6.

The heartbeat service (HBD) runs on master nodes and sends heartbeat packets across the network to

the other instances of Heartbeat (running on other master nodes) as a keep-alive type message. When

the standby master node no longer receives heartbeat packets, it assumes that the active master node

is dead, and then the standby node becomes primary. The heartbeat mechanism is a contribution from

the Linux-HA project [20]. We have contributed enhancements to the heartbeat service to

accommodate for the HAS architecture requirements. Section 4.19 discusses the heartbeat service and

its integration with the HAS architecture.

The Linux director daemon (LDirectord) is responsible for monitoring the availability of the web

server application running on the traffic nodes by connecting to them, making an HTTP request, and

checking the result. If the LDirectord module discovers that the web server application is not

available on a traffic node, it communicates with the traffic manager to ensure that the traffic manager

does not forward traffic to that specific traffic node. Section 4.20 presents the functionalities of the

LDirectord.

4.5 HAS Architecture Tiers

The following subsections describe the three HAS architecture tiers illustrated in Figure 43: the high

availability tier, scalability and service availability tier, and the storage tier.

95

4.5.1 The High Availability Tier

The HA tier consists of master nodes that act as a dispatcher for the SSA tier. The role of the master

node is similar to a connection manager or a dispatcher. The HA tier does not tolerate service

downtime. If the master nodes are not available, the traffic nodes in the SSA tier become unreachable

and as a result, the HAS cluster cannot accept incoming traffic. The primary functions of the nodes in

this tier are to handle incoming traffic and distribute it to traffic nodes located in the SSA tier, and to

provide cluster infrastructure services to all cluster nodes.

The HA tier consists of two nodes configured following the 1+1 active/standby redundancy model.

The architecture supports the extension redundancy model of this tier to the 1+1 active/active

redundancy model. With the 1+1 active/active redundancy model, master nodes share servicing

incoming traffic to avoid bottlenecks at the HA tier level. Another possible extension to the

redundancy model is the support of the N-way and the N+M redundancy models, which allows the

HA tier to scale the number of master nodes one at a time. However, it requires a complex

implementation and it is not yet supported it. Section 4.8 describes the supported redundancy models.

Furthermore, the master nodes provide cluster wide services to nodes located in the SSA and the

storage tiers. The HA tier controls the activity in the SSA tier, since it forwards incoming requests to

the traffic nodes. Therefore, the HA tier needs to determine the status of traffic nodes and be able to

reliably communicate with each traffic node. The HA tier uses traffic managers to receive load

information from traffic nodes. Section 4.6.1 presents the characteristics of the HA tier. Section 4.8

discusses the redundancy models supported by this tier.

4.5.2 The Scalability and Service Availability Tier

The scalability and service availability (SSA) tier consists of traffic nodes that run web server

applications. The main task of the traffic nodes is to receive, process, and reply to incoming requests.

This tier provides a redundant application execution cluster. In the event that the traffic manager

daemon running on the master nodes ceases to receive load notification from the traffic client

daemon, then the traffic manager stops forwarding traffic to that specific traffic node. Section 4.27.9

discusses this scenario.

If the traffic node becomes unresponsive (after a timeout limit defined in the configuration file), the

traffic manager declares the traffic node unavailable and stops forwarding traffic to it. Section 4.27.9

discusses this use case and presents its sequence diagram. The supported redundancy model is the N-

96

way model: all nodes are active and there are no standby nodes. As such, redundancy is at node level.

Section 4.6.2 presents the characteristics of the SSA tier. Section 4.10 discusses the redundancy

models supported by this tier.

4.5.3 The Storage Tier

The storage tier consists of specialized nodes that provide shared storage for the cluster. This tier

differs from the other two tiers in that it is an optional tier, and not required if the master nodes are

providing shared storage through a distributed file system. Since storage is not the focus of this

dissertation, we do not explore this area in depth. Instead, our interest in this area is limited to

providing a highly available storage and access to storage independently from the storage techniques.

For instance, to avoid single points of failure, we provide redundant access paths from the cluster

nodes to the shared storage. Our efforts in this area include a contribution of a HA extension to the

Network File Server (NFS). In the modified version of the NFS, the file system supports two

redundant servers, where one server failure is transparent to the users. Master nodes can provide the

storage service using the highly available NFS implementation that requires a modified

implementation of the mount program. Cluster nodes use the modified mount program to mount

simultaneously two network file servers at the same mounting point. Section 4.6.3 presents the

characteristics of the storage tier. Section 4.11 presents the redundancy models supported by this tier.

4.6 Characteristics of the HAS Cluster Architecture

The HAS architecture is the combination of the HA, SSA, and storage tiers. The architecture inherits

the characteristics of all tiers, discussed in Sections 4.6.1, 4.6.2, and 4.6.3. The HAS architecture does

not require specialized hardware or commercial software components; instead, the architecture is

implementable using COTS hardware and open source software components. The following sub-

sections discuss the characteristics of each architecture tier.

4.6.1 Characteristics of the HA Tier

The HA tier consists of two master nodes that provide cluster services to traffic nodes and direct

incoming requests to the SSA tier. The HA tier has the following characteristics:

- No single point of failure: If a master node fails and becomes unavailable, the standby master

node takes over in a transparent fashion. Section 4.27.8 presents the sequence diagram of this

case scenario. The HA tier provides node level redundancy. All software components are

97

redundant and available on both master nodes. If one master node fails, the other master node

handles incoming traffic, and provides cluster services to cluster nodes. Section 4.8 discusses the

redundancy model of the HA tier.

- Sensible repair and replacement model: The architecture allows the upgrade or replacement of

failed components without affecting the service availability. For instance, if the active master

node requires a processor upgrade, then the administrator gracefully switches control from the

active master node to the standby node; the active master becomes standby and the standby

master becomes active. The cluster continues to provide services without associated downtime

and without performance penalties, with the exception that load sharing among master nodes

(1+1 active/active) is not possible since there is only one master node available. With such a

model, upgrades do not require a cluster or a service downtime, as it is the case with SMP or

MMP systems. Instead, only the affected node is involved, and the service continues to be

available to end users.

- Shared storage: Master nodes require private disk storage to maintain the configuration files for

the services they provide to all cluster nodes. Application data on the other hand is stored on

shared storage. If a traffic node fails, the application data is still available to the surviving traffic

nodes. Therefore, all application data is available exclusively on redundant highly available

storage and is not dependent on the serving nodes.

- Master node failure detection: The heartbeat mechanism running on master nodes ensures that

the standby master node detects the failure of the active master node within a delay of 200 ms.

Section 4.19 describes how the architecture handles the failure of a master node.

- Number of nodes in the high availability tier: The current implementation of the HA tier supports

two master nodes. In this model, the standby node is available to allow a quick transition when

the active node fails, or for load-sharing purposes. We can add more master nodes to provide

additional protection against multiple failures. The HA tier can scale from two to N master nodes.

Although the addition of nodes would provide additional protection and higher capacity, it

introduces significant complexity to the implementation of the system software. The current

prototype HAS architetcure supports the 1+1 redundancy model in both active/standby and

active/active modes. Section 4.8 describes the supported redundancy models.

- Cluster Virtual IP Interface (CVIP): The cluster virtual IP interface is a transparent layer that

presents the cluster as a single entity to the outside world, whether it is users to the cluster, or

98

other systems. Users of the system are not aware that the web system consists of multiple nodes,

and they do not know where the applications run. Section 4.21 describes the CVIP interface.

4.6.2 Characteristics of the SSA Tier

The SSA tier consists of multiple independent traffic nodes, each running a copy of the Apache web

server software (version 2.0.35), and service incoming requests. Theoretically, there are no limitations

on the number of traffic nodes in this tier: the number of nodes can be N, where N ≥ 2 to insure that

traffic nodes do not constitute a SPOF. The HAS architecture prototype consisted of 2 master nodes

and 16 traffic nodes. Redundancy in the SSA tier is at the node level. Requests coming from the HA

tier are assigned to traffic nodes based on the traffic distribution algorithm (Section 4.23). Traffic

nodes reply to requests directly to the web clients eliminating possible bottleneck at master nodes.

The characteristics of the SSA tier are the following:

- Node availability: Single node availability is less important in this tier, since there are N traffic

nodes available to serve requests. If one traffic node becomes unavailable, the master node

removes the failed traffic node from its list of available traffic nodes, and directs traffic to

available traffic nodes. Section 4.27.9 presents the case scenario of a failing traffic node.

- Traffic node failure detection: In the same way the heartbeat mechanism checks the health of the

master nodes in the HA tier, the HAS architecture supports mechanisms to verify that a traffic

node is up and running, and that the web software application is providing service to web clients.

We use two methods simultaneously to ensure that traffic nodes are healthy and that the web

server application is up and running. The first check is a continuous communication between the

traffic manager running on the master node and the traffic client running of the traffic node

(Section 4.23). The second check is an application check that ensures that the web server

application is up and running (Section 4.20).

- Support for diskless nodes: The SSA tier supports diskless traffic nodes. Traffic nodes are not

required to have local disks and rely on image servers for booting and downloading a kernel and

disk image into their memory.

- Seamless upgrades without service interruption: With the HAS architecture, it is possible to

upgrade the operating system and the application software without disturbing the service

availability. We provide the mechanisms to upgrade the kernel and application automatically

through an image server, by rebooting the nodes. Upon reboot, the traffic node downloads an

99

updated kernel and the new version of the ramdisk (if the node is diskless) or a new image disk (if

the node has disk) from the image server. Section 4.27.5 presents this upgrade scenario with the

sequence diagram.

- Hosting application servers: Traffic nodes run application servers that can be stateful or stateless.

If an application requires state information, then the application saves the state information on the

shared storage and makes it available to all cluster nodes.

4.6.3 Characteristics of the Storage Tier

Data oriented applications require that storage and access to storage is available at all times. To

ensure reliability and high availability, the architecture does not allow application data to be stored on

local node storage devices. Application servers maintain their persistent data on the shared storage

nodes located in the storage tier, or on shared storage managed by master nodes through a redundant

network file server (Section 4.16.2). Since storage techniques are outside the scope of the thesis, we

do not provide a full coverage of this area. However, the architecture is flexible to support multiple

ways of providing storage through specialized storage nodes, software RAID and distributed file

systems. The supported storage models include:

- Specialized storage nodes use techniques such as storage area networks or network-attached

storage to provide highly available shared storage over a network to a large network of users.

- RAID techniques consist of established techniques to achieve data reliability through redundant

copies of the data. Since we are using COTS components, then software RAID is our choice to

provide redundant and reliable data storage. In addition, software RAID techniques do not require

specialized hardware; therefore, we can use it with master nodes when master nodes are

providing shared storage through the NFS.

- Distributed file systems allow us to combine all the disk storage on multiple cluster nodes under

one virtual file system.

4.7 Availability and Single Points of Failures

The architecture provides facilities to enable fast error detection, and provides correction procedures

to recover, when it is possible. One of our goals is to increase service availability. Availably is a

measure of the time that a server is functioning normal. We calculate the availability as,

100

100*MTTR MTBF

MTBF⎟⎟⎠

⎞⎜⎜⎝

⎛+

=A ,

where A is the percentage of availability, MTBF is the mean time between failures and MTTR is the

mean time to repair or resolve a particular problem. According to the formula, we calculate

availability A as the percentage of uptime for a given period, taking into account the time it requires

for the system to recover from unplanned failures and planned upgrades. As MTTR approaches zero,

the availability percentage A increases towards 100 percent. As the MTBF value increases, MTTR

has less impact on A. Following this formula, there are two possible ways to increase availability:

increasing MTBF and decreasing MTTR. Increasing MTBF involves improving the quality or

robustness of the software and using redundancy to remove single points of failures. As for

decreasing MTTR, our focus in the implementation of the system software is to streamline and

accelerate fail-over, respond quickly to fault conditions, and make faults more granular in time and

scope to allow us to have many short faults than a smaller number of long ones, and to limit scope of

faults to smaller components.

To increase MTTR among the HAS architecture components, we need to avoid a SPOF. The

following subsections discuss eliminating SPOF at master node levels, traffic nodes, application

servers, networks and network interfaces, and storage nodes.

The HAS architecture supports fault tolerance through features such as the hot-standby data

replication to enable node failure recovery, storage mirroring to enable disk fault recovery, and LAN

redundancy to enable network failure recovery. The topology of the architecture enables failure

tolerance because of the various built-in redundancies within all layers of the HAS architecture.

Figure 44 illustrates the supported redundancy at the different layers of the HAS architecture. The

cluster virtual IP interface (1) provides a transparent layer that hides the internal of the cluster. We

can add or remove master nodes from the cluster without interruptions to the services (2). Each

cluster node has two connections to the network (3) ensuring network connectivity. Many factors

contribute towards achieving network and connection availability such as the availability of

redundant routers and switches, redundant network connections and redundant Ethernet cards. We

contributed an Ethernet redundancy mechanism to ensure high availably for network connections. As

for traffic nodes (4), redundancy is at the node level, allowing us to add and remove traffic nodes

transparently and without service interruption. We can guarantee service availability by providing

multiple instances of the application running on multiple redundant traffic nodes. The HAS

101

architecture supports storage redundancy (5) through a customized HA implementation of the NFS

server; alternatively, we can also use redundant specialized storage nodes.

Figure 44: Built-in redundancy at different layers of the HAS architecture

The following sub-sections discuss eliminating SPOF at each of the HAS architecture layers.

4.7.1 Eliminating Master Nodes as a Single Point of Failure

The HA tier of the HAS architecture consists of two master nodes that follow the 1+1 redundancy

model. In the 1+1 active/standby redundancy model (Section 4.27.1.1), if the active master node is

unavailable and does not respond to the heartbeat messages, the standby master node declares it as

dead and assumes the active state. Section 4.27.8 discusses this scenario. If the master nodes follow

the 1+1 active/active redundancy model (Section 4.27.1.2), the failures in one master node are

transparent to end users, and do not affect the service availability.

4.7.2 Eliminating Applications as a Single Point of Failure

The primary goal of the HAS architecture is to provide a highly available environment for web server

applications. Each traffic node runs a copy of the web server. These applications represent a SPOF,

since in the event that the application crashes, the service on that traffic node becomes unavailable.

To ensure the availability of these applications, the HAS architecture prototype supports two

mechanisms. The first mechanism is a health check between the traffic manager and the traffic client.

Section 4.23.5 discusses this health check mechanism, and Section 4.27.11 presents the sequence

diagram of this scenario. The second mechanism is an application health check to ensure that the

Cluster Virtual Interface

…

Network Redundancy

Cluster Interface to theOutside World

Traffic Node Redundancy

Storage Redundancy(includes NFS redundancyand RAID 5)

Redundancy of Nodes Providing cluster services

1

2

3

4

Master Node A Master Node B

Traffic Node A Traffic Node N

Storage Node A Storage Node B5

1

2

3

4

5

UsersUsers

Cluster Virtual Interface

…

Network Redundancy

Cluster Interface to theOutside World

Traffic Node Redundancy

Storage Redundancy(includes NFS redundancyand RAID 5)

Redundancy of Nodes Providing cluster services

1

2

3

4

Master Node AMaster Node A Master Node BMaster Node B

Traffic Node ATraffic Node A Traffic Node NTraffic Node N

Storage Node AStorage Node A Storage Node BStorage Node B5

1

2

3

4

5


102

application is up and running. If the application is not available, the health check mechanism notifies

the traffic manager that removes the traffic node from its list of available node, and as a result, the

traffic manager stops forwarding traffic to it. Sections 4.27.9 and 4.27.12 present the sequence

diagrams of this scenario.

4.7.3 Eliminating Network Adapters as a Single Point of Failure

The Ethernet redundancy daemon, a system software component, handles network adapter failures.

Figure 45 illustrates the Ethernet adapter swapping process. When a network adapter (Ethernet card)

fails, the Ethernet redundancy daemon swaps the roles of the active and standby adapters on that

node. The failure of the active adapter is transparent with a delay of less than 350 ms while the

system switches to the standby Ethernet adapter.

Figure 45: The process of the network adapter swap

4.7.4 Eliminating Storage as a Single Point of Failure

Shared storage is a possible SPOF in the cluster if it relies on a single storage server. Section 4.16

discusses the physical model description which covers eliminating storage as a SPOF using both a

custom implementation of a highly available network file system (Section 4.16.2), and redundant

specialized storage nodes (Section 4.16.3).

4.8 Overview of Redundancy Models

The following sub-sections examine the various redundancy models: the 1+1 two nodes redundancy

model, the N+M and N-way redundancy models, and the no redundancy model.

The active Ethernet adapter provides the connection to the network.The standby Ethernet adapter is hiddenfrom applications and is known only tothe Ethernet redundancy daemon.

Active Standby Active Standby

The active Ethernet adapter has failed and the Ethernet redundancy daemon designates the former standby Ethernet adapter as the new active adapter.

Unavailable Active

The active Ethernet adapter provides the connection to the network.The standby Ethernet adapter is hiddenfrom applications and is known only tothe Ethernet redundancy daemon.


The active Ethernet adapter has failed and the Ethernet redundancy daemon designates the former standby Ethernet adapter as the new active adapter.

Unavailable Active

103

4.8.1 The 1+1 Redundancy Model

There are two types of the 1+1 redundancy model (also called two-node redundancy model): the

active/standby, which is also called the asymmetric model, and the active/active or the symmetric

redundancy model [115]. With the 1+1 active/standby redundancy model, one cluster node is active

performing critical work, while the other node is a dedicated standby, ready to take over should the

active master node fails. In the 1+1 active/active redundancy model, both nodes are active and doing

critical work. In the event that either node should fail, the survivor node steps in to service the load of

the failed node until the first node is back to service.

4.8.2 The N+M Redundancy Model

In the N+M redundancy model, the cluster tier support N active nodes and M standby nodes. If an

active node fails, a standby node takes over the active role. If M=1, then the model would be the N+1

redundancy model, where the “+1” node is standby, ready to be active when one of the active nodes

fail. The N+1 redundancy model can be applied to services that do not require a high level of

availability because an N+1 cluster cannot provide a standby node to all active nodes. N+1 cannot

perform hot standby because the standby node never knows which active nodes will fail. One

advantage of the N+M over the N+1 is that in the N+M (with M ≥ 2) provide higher availability

should more than one node fails, while not affecting the performance and throughput of the cluster.

4.8.3 The N-way Redundancy Model

In the N-way redundancy model, all N nodes are active. Redundancy is at the node level.

4.8.4 The “No Redundancy” Model

The “no redundancy” model is used with non-critical services, when the failure of a component does

not cause a severe impact on the overall system. Following this model, an active component does not

have a standby ready to take over if the active fails. We list this model for completion purposes.

4.9 HA Tier Redundancy Models

The HA tier supports three redundancy models: the 1+1 active/standby, 1+1 active/active and the N-

way redundancy model. The current prototype of the HA tier supports the 1+1 redundancy model in

both configurations: active/standby and active/active. Support for the N+M and for the N-way

redundancy models is a future work item.

104

4.9.1 The HA Tier Active/Standby Redundancy Model

Figure 46 illustrates the 1+1 active/standby redundancy model supported by the HA tier, which

consists of two master nodes, one is active and the second is standby. The active master node hosts

active processes and the standby nodes host standby processes that are ready to take over when

primary node fails. The active/standby model enables fast recovery upon the failure of the active

node.

Figure 46: The 1+1 active/standby redundancy model

Figure 47: Illustration of the failure of the active node

Public Network

ActiveMasterNode

StandbyMasterNode

Heartbeat Messages

Shared NetworkStorage

Dual Redundant Data Paths

Physically connectedbut not logically in use

Clients

Public Network

ActiveMasterNode

StandbyMasterNode

Heartbeat Messages


Dual Redundant Data Paths

Physically connectedbut not logically in use

Clients

Public Network

ActiveMasterNode

StandbyMasterNode

Heartbeat Messages


Failed Node

Now ActiveMaster Node

Physically connectedand in use

Physically connectedbut not logically in usedue to the failure of the master node

Clients

Public Network

ActiveMasterNode

StandbyMasterNode

Heartbeat Messages


Failed Node

Now ActiveMaster Node

Physically connectedand in use

Physically connectedbut not logically in usedue to the failure of the master node

Clients

105

Figure 47 illustrates the 1+1 active/standby pair after the failover has completed. The active/standby

redundancy model supports connection synchronization between the two master nodes. Section 4.22

discusses connection synchronization.

The HA tier can transition from the 1+1 active/standby to the 1+1 active/active redundancy model

through the redundancy configuration manager, which is responsible for switching from one

redundancy model to another. The 1+1 active/standby redundancy model provides high availability;

however, it requires a master node to sit idle waiting for the active node to fail so it can take over. The

active/standby model leads to a waste of resources and limits the capacity of the HA tier.

The 1+1 active/active redundancy model, discussed in the following section, addresses this problem

by allowing the two master nodes to be active and to serve incoming requests for the same virtual

service.

4.9.2 The HA Tier Active/Active Redundancy Model

The active/standby redundancy model offers one way of providing high availability. However, we

would like to take advantage of the standby node and not have it sitting idle. By moving to the

active/active redundancy model, both nodes in this tier are active and resulting in an increased cluster

throughput. The active/active model allows two nodes to load balance the incoming connections for

the same virtual service at the same time. It provides a solution where the master nodes are servicing

incoming traffic and distributing it to the traffic nodes; as a result, we increase the capacity of the

whole cluster, while still maintaining the high availability of master nodes.

Figure 48 illustrates the 1+1 active/active redundancy model, which supports load sharing between

the two master nodes in addition to concurrent access to the shared storage. If the HAS cluster

experiences additional traffic, we can extend the number of nodes in this tier and support the N-way

redundancy model, where all nodes in the tier are active and provide service. However, increasing the

number of nodes beyond two nodes increases the complexity of the implementation.

The proposed method of providing active/active in the HA tier of a HAS cluster is to have all master

nodes configured with the same Ethernet hardware address (MAC) and IP address. Originally, the

support of the active/active was not planned as part of the HAS architecture prototype; instead, the

HAS prototype supported the active/standby and the benchmarking tests (Chapter 5) were performed

on the HAS prototype with the HA tier configured in the active/standby redundancy model. The

support for the active/active redundancy model came at a much later stage.

106

Figure 48: The 1+1 active/active redundancy model

4.9.2.1 The Role of the Saru Module

When the HA tier is in the active/active redundancy model, each master node has the same hardware

address (MAC) and IP address and the challenges becomes on how to distribute incoming

connections between the two active master nodes. For that purpose, we have enhanced the saru

module [116], open source system software, to run in coordination with heartbeat on each of the

master nodes and be responsible for dividing the incoming connections between the two master

nodes. The heartbeat daemon provides a mechanism to determine which master node is available and

the saru module uses this information to divide the space of all possible incoming connections

between the two available master nodes.

If the saru module divides the blocks in the results space, then it is possible that an inconsistent state

results between the master nodes, where some blocks are allocated to more than one master node, or

some blocks are not allocated at all. Therefore, one approach to solve this problem is to have a saru

master node and a saru slave node. When the master nodes boot, the saru module starts and elects a

master node to do the allocations. The elected master node divides blocks of source or destination

ports or addresses between the two active master nodes.

The notion of a master and a slave node is only a internal convenience for the saru module. Both

master nodes are active and continue to service incoming requests distribute it to the traffic nodes.

Public Network

ActiveMasterNode 1

ActiveMasterNode 2

Heartbeat Messages


Physically connectedand in use providing redundant data pathfor master node 2

Clients


Public Network

ActiveMasterNode 1

ActiveMasterNode 2

Heartbeat Messages



Clients


107

4.10 SSA Tier Redundancy Models

The SSA tier supports the N+M and the N-way redundancy models. In the N+M model, N is the

number of active traffic nodes hosting the active web server application, and M is the number of

standby traffic nodes. When M=0, it is the N-way redundancy model where all traffic nodes are

active. Following the N-way redundancy model, traffic nodes operate without standby nodes. Upon

the failure of an active traffic node, the traffic manager running on the master node removes the failed

traffic node from its list of available traffic nodes (Section 4.27.9) and redirects traffic to available

traffic nodes.

Figure 49 illustrates the N-way redundancy model. The state information of the web server

application running on the active nodes is saved (1) on the HA shared storage (2). When the

application running on the active node fails, the application on the standby node accesses the saved

state information on the shared storage (3) and provides continuous service.

Figure 49: The N+M and N-way redundancy models

In Figure 50, the state information of the application running on the active node is saved (1) on the

HA shared storage.

Node A Node B Node C Node D

StandbyNode 1

StandbyNode 2

StandbyNode 1

StandbyProcess

ActiveProcess

HA SharedStorage

State2

1

3


HA SharedStorage

State2

1 3

(1) State information is written to shared storage

(2) State information is available on shared storage

(3) Traffic nodes have access to the state information.

N+M Redundancy Model N-way Redundancy Model


StandbyNode 1

StandbyNode 2

StandbyNode 1

StandbyProcess

ActiveProcess

HA SharedStorage

State2

1

3


HA SharedStorage

State2

1 3

(1) State information is written to shared storage

(2) State information is available on shared storage

(3) Traffic nodes have access to the state information.

N+M Redundancy ModelN+M Redundancy Model N-way Redundancy Model

108

Figure 50: The N+M redundancy model with support for state replication

In Figure 51, when the application running on the active node fails, the application on the standby

node accesses the saved state info on the shared storage and provides a continuous service.

Figure 51: The N+M redundancy model, after the failure of an active node

Supporting applications that require maintaining state information is not the scope of this dissertation.

However, the WebBench tool provides dynamic test suites. Therefore, in our testing, we used a

combination of both static (STATIC.TST) and dynamic test suites (WBSSL.TST). The static test

suite contains over 6,200 static pages and executable test files. The dynamic test suites use

applications that run on the server and require maintaining a state. Although supporting applications

with state is not in the scope of the work, the HAS architecture still handles them.

HA high speed connectivity

Active Node

HA SharedStorage

State

Processor

Operating System

Middleware

Application

Processor

Middleware

Application

Processor

Middleware

Application

State

Processor

Operating System

Middleware

Application

Processor

Middleware

Application

Processor

Middleware

Application1

2

Processor

Operating System

Middleware

Application

Processor

Middleware

Application

Processor

Middleware

Application

Standby Node

Processor

Operating System

Middleware

Application

Processor

Middleware

Application

Processor

Middleware

Application

Active Node Active Node


Active Node

HA SharedStorage

State

Processor

Operating System

Middleware

Application

Processor

Middleware

Application

Processor

Middleware

Application

State

Processor

Operating System

Middleware

Application

Processor

Middleware

Application

Processor

Middleware

Application1

2

Processor

Operating System

Middleware

Application

Processor

Middleware

Application

Processor

Middleware

Application

Standby Node

Processor

Operating System

Middleware

Application

Processor

Middleware

Application

Processor

Middleware

Application



Failed Node

HA SharedStorage

State

Processor

Operating System

Middleware

Application

Processor

Middleware

Application

Processor

Middleware

Application

Processor

Operating System

Middleware

Application

Processor

Middleware

Application

Processor

Middleware

Application

Processor

Operating System

Middleware

Application

Processor

Middleware

Application

Processor

Middleware

Application

Active Node

Processor

Operating System

Middleware

Application

Processor

Middleware

Application

Processor

Middleware

Application


3


Failed Node

HA SharedStorage

State

Processor

Operating System

Middleware

Application

Processor

Middleware

Application

Processor

Middleware

Application

Processor

Operating System

Middleware

Application

Processor

Middleware

Application

Processor

Middleware

Application

Processor

Operating System

Middleware

Application

Processor

Middleware

Application

Processor

Middleware

Application

Active Node

Processor

Operating System

Middleware

Application

Processor

Middleware

Application

Processor

Middleware

Application


3

109

4.11 Storage Tier Redundancy Models

Although storage is outside the scope of our work, the redundancy models of the storage tier depend

on the physical storage model described in Section 4.16.

4.12 Redundancy Model Choices

Figure 52 identifies the various redundancy models supported for master, traffic and storage nodes.

Figure 52: the redundancy models at the physical level of the HAS architecture

Master nodes follow the 1+1 redundancy model. The HA tier hosts two master nodes that interact

with each other following the active/standby model or the active/active (load sharing) model. Traffic

nodes follow one of two redundancy models: N+M (N active and M standby) or N-way (all nodes are

active). In the N+M active/standby redundancy model, N is the number of active traffic nodes

available to service requests. We need at least two active traffic nodes, N ≥ 2. M is the number of

Traffic Nodes: t ≥ 2Redundancy is at node levelRedundancy models: N active | N active & M standby, where N ≥ 2 and M ≥ 0

Redundant storage through the implementation of a HA NFS server

N active N active / M standby

Redundant SpecializedStorage Nodes

Master Nodes: m=21+1 redundancy model,Active/Standby | Active/Active

Storage:Either use redundant specialized nodes: s = 2, or use the contributed HA NFS implementation

HA Tier

SSATier

StorageTier

Local networks, l=2

routers, r=2

= Active Node

= Standby Node

Traffic Nodes: t ≥ 2Redundancy is at node levelRedundancy models: N active | N active & M standby, where N ≥ 2 and M ≥ 0

Redundant storage through the implementation of a HA NFS server

N active N active / M standby

Redundant SpecializedStorage Nodes

Master Nodes: m=21+1 redundancy model,Active/Standby | Active/Active

Storage:Either use redundant specialized nodes: s = 2, or use the contributed HA NFS implementation

HA Tier

SSATier

StorageTier

Local networks, l=2

routers, r=2

= Active Node

= Standby Node

110

standby traffic nodes, available to replace an active traffic node as soon as it becomes unavailable.

The N-way redundancy model is the N+M redundancy model with M = 0. In the N-way redundancy

model, all traffic nodes are in the active mode and servicing requests with no standby traffic nodes.

When a traffic node becomes unavailable, the traffic manager stops sending traffic to the unavailable

node and redistributes incoming traffic among the remaining available traffic nodes. However, when

standby nodes are available, the throughput of the cluster does not suffer from the loss of a traffic

node since the standby node takes over the unavailable traffic node.

As for the storage tier, the redundancy model depends on various possibilities ranging from hosting

data on the master nodes to having separate and redundant nodes that are responsible for providing

storage to the cluster. The redundancy configuration manager is responsible for switching from one

redundancy model to another. For the purpose of the work, we did not implement the redundancy

configuration manager (Section 6.2.4). Rather, we relied on re-starting the cluster nodes with

modified configuration files when we wanted to experiment with a different redundancy model.

Table 8 provides a summary of the possible redundancy models. The HAS architecture allows the

support for all redundancy models and supporting them is an implementation issue.

1+1

Active/Active

1+1

Active/Standby

N+M N-Way No Redundancy

HA Tier X

X X

X

(N+M, with M=0)

X

(one master node)

SSA

Tier

X X X X X

(one traffic node)

Storage

Tier

X X X X X

(one storage node)

Table 8: The possible redundancy models per each tier of the HAS architecture

Table 9 illustrates the implemented redundancy models for the HAS architecture prototype. At the

HA tier, both the 1+1 active/standby and the 1+1 active/active redundancy models are supported. At

the SSA tier, the HAS architecture supports the N-way redundancy model. The storage tier supports

the 1+1 active/active redundancy model.

111

1+1 Active/Active 1+1 Active/Standby N+M N-Way No Redundancy

HA Tier X

X X

(one master node)

SSA Tier X

X

(one traffic node)

Storage Tier X X

(one NFS server)

Table 9: The supported redundancy models per each tier in the HAS architecture prototype

4.13 The States of a HAS Cluster Node

A state transition diagram illustrates how a cluster node transitions from one defined state to another,

as a result to certain events. We represent the states of the nodes as circles, and label the transitions

between the states with the events or failures that changed the state of the node from one state to

another. A node starts in an initial state, represented by the closed circle; and can end up in a final

state, represented by the bordered circle. A cluster node can be in one of four HA states: active,

standby, in-transition, and out-of-cluster.

We identify two types of error conditions: recoverable and unrecoverable errors. Recoverable errors

such as the failure of an Ethernet adapter do not qualify as error conditions that require the node to

change state because such a failure is recoverable. However, other failures such as a kernel crash are

unrecoverable and as a result require a change of state.

Figure 53 presents the HA states of a node, discarding the out-of-cluster state for simplicity purposes.

We assume that the initial state of a node is active (1). When in the active state, the node is ready to

service incoming requests or provide cluster services. When the active node is facing some hardware

or software problem forcing it to stop service, it transitions (2)(3) to the out-of-cluster state. In this

state, the node is not member of the cluste; it does not receive traffic (traffic node) or provide cluster

services to other nodes (master node). When the problem is fixed, the node transitions back to the

active state (4)(5) and become the active. A node is in-transition when it is changing state. If the node

is changing to an active state (5), after the completion of the transition, the node starts receiving

incoming requests.

112

Figure 53: The state diagram of the state of a HAS cluster node

Figure 54 represents the state diagram after we expand it to include the standby state, in which the

node is not currently providing service but prepared to take over the active state. This scenario is only

applicable to nodes in the HA tier, which supports the 1+1 active/standby redundancy model. When

the node is in the active, in-transition or standby state, and it encounters software or hardware

problems, it becomes unstable and it will not be member of the HAS cluster. Its state becomes out-of

cluster and it is not anymore available to service traffic. If the transition is from active to standby, the

node stops receiving new requests and providing services, but keeps providing service to ongoing

requests until their termination, when possible; otherwise, ongoing requests are terminated.

The system software that manages the transition of states are the traffic manager and the heartbeat

daemon running on the master nodes, and the traffic client and the LDirectord running on the traffic

nodes.

Initial State (Assuming initial state is active for this node)

[The active node is facing problems forcing a it to change state]

[The node is now consideredoutside the cluster. It doesnot serve traffic nor provide services to cluster nodes]

[Problem is fixed, nodere-joining cluster]

1

2

3

5

4

In-Transition

Switching fromout of cluster

to active

Active

Accept & service traffic

Out of Cluster

In-Transition

Stop acceptingnew requests

Initial State (Assuming initial state is active for this node)

[The active node is facing problems forcing a it to change state]

[The node is now consideredoutside the cluster. It doesnot serve traffic nor provide services to cluster nodes]

[Problem is fixed, nodere-joining cluster]

1

2

3

5

4

In-Transition

Switching fromout of cluster

to active

Active


Active


Out of Cluster

In-Transition


In-Transition


113

Figure 54: The state diagram including the standy state

4.14 Example Deployment of a HAS Cluster

Figure 55 illustrates an example prototype of the HAS architecture using a distributed file system to

provide shared storage among all cluster nodes. In this example, two master nodes (A and B) provide

storage using the contributed implementation of the highly available NFS (Section 4.16.2). The HA

tier consists of two master nodes in the 1+1 active/standby redundancy model. Node A is the active

node and Node B is the standby node. The SSA tier consists of four traffic nodes each with its own

local storage. The SSA tier follows the N-way redundancy model, where all traffic nodes are active.

There are two redundant local area networks, LAN 1 and LAN 2, connected to two redundant routers.

The minimal prototype of the HAS architecture requires two master nodes and two traffic nodes. In

the event that the incoming web traffic increases, we add more traffic nodes into the SSA tier. If we

scale the number of traffic nodes, no other support nodes are required and no changes are required for

either the master nodes or the current traffic nodes.

[Error]

[Error]

[Error]

[Error]

Out ofCluster

Out ofCluster

Out ofCluster

Out ofCluster

Standby

Active


InTransition

Switching fromStandby to

active

InTransition

Stop acceptingNew requests

[Error]

[Error]

[Error]

[Error]

Out ofCluster

Out ofClusterOut ofCluster



StandbyStandby

Active


Active


InTransition


active

InTransition


active

InTransition


InTransition


114

Figure 55: A HAS cluster using the HA NFS implementation

Figure 56 illustrates the hardware configuration of an HA-OSCAR cluster [117]. The system consists

of a primary server, a standby server, two LAN connections, and multiple compute clients, where all

the compute clients have homogeneous hardware.

Figure 56: The HA-OSCAR prototype with dual active/standby head nodes

A server is responsible for serving user’s requests and distributing the requests to specified clients. A

compute client is dedicated to computation [118]. Each server has three network interface cards; one

Master Node B

LAN 1 LAN 2

CLUSTER

VIP

HA Tier SSA Tier

Traffic Node 4

Traffic Node 3

Master Node A

Traffic Node 2

Traffic Node 1

LocalDisk

LocalDisk

LocalDisk

DataSynchronization

Node BLocal Disk

Node ALocal Disk

DFS

Heartbeat

LocalDisk

Master Node B

LAN 1 LAN 2

CLUSTER

VIP

HA Tier SSA Tier

Traffic Node 4

Traffic Node 3

Master Node A

Traffic Node 2

Traffic Node 1

LocalDisk

LocalDisk

LocalDisk

DataSynchronization

Node BLocal Disk

Node ALocal Disk

DFS

Heartbeat

LocalDisk

115

interface card connects to the Internet by a public network address, the other two connect to a private

LAN, which consists of a primary Ethernet LAN and a standby LAN. Each LAN consists of network

interface cards, switch, and network wires, and provides communication between servers and clients,

and between the primary server and the standby server. The primary server provides the services and

processes all the user’s requests. The standby server activates its services and waits for taking over

the primary server when its failure is detected. The periodical transmission of heartbeat messages

travels across the Ethernet LAN between the two servers, and works as health detection of the

primary server. When a primary server failure occurs, the heartbeat detection on the standby server

cannot receive any response message from the primary server. After a prescribed time, the standby

server takes over the alias IP address of the primary server, and the control of the cluster transfers

from the primary server to the standby server. The user’s requests are processed on the standby server

from later on. From the user’s point of view, the transfer is almost seamless except the short

prescribed time. The failed primary server is repaired after the standby server taking over the control.

Once the repair is completed, the primary server activates the services, takes over the alias IP address,

and begins to process user’s requests. The standby server releases its alias IP address and goes back to

initial state.

At a regular interval, the running server polls all the LAN components specified in the cluster

configuration file, including the primary LAN cards, the standby LAN cards, and the switches.

Network connection failures are detected in the following manner. The standby LAN interface is

assigned to be the poller. The polling interface sends packet messages to all other interfaces on the

LAN and receives packets back from all other interfaces on the LAN. If an interface cannot receive or

send a message, the numerical count of packets sent and received on an interface does not increment

for an amount of time. At this time, the interface is considered down. When the primary LAN goes

down, the standby LAN takes over the network connection. When the primary LAN is repaired, it

takes over the connection back from the standby LAN.

Whenever a client fails while the cluster is in operation, the cluster undergoes a reconfiguration to

remove the corresponding failing client node or to admit a new node into the cluster. This process is

referred to as a cluster state transition. The HA OSCAR cluster uses a quorum voting scheme to keep

the system performance requirement, where the quorum Q is the minimum number of functioning

clients required for a HPC system. We consider a system with N clients, and assume that each client

is assigned one vote. The minimum number of votes required for a quorum is given by (N+2)/2 [119].

116

Whenever the total number of the votes contributed by all the functioning clients falls below the

quorum value, the system suspends operation. Upon the availability of sufficient number of clients to

satisfy the quorum, a cluster resumption process takes place, and brings the system back to

operational state.

4.15 The Physical View of the HAS Architecture

The physical model specifies the architecture topology in terms of hardware architecture and

organization of physical resources (processors, network and storage elements). It also describes the

hardware layers and takes into account the system's non-functional requirements such as system

connection availability, reliability and fault-tolerance, performance (in terms of throughput) and

scalability.

The HAS architecture consists of a collection of inter-connected nodes presented to users as a single,

unified computing resource. Nodes are independent compute entities that physically reside on a

common network. The software executes on a network of computers. We map the various elements

identified in the logical, process, and development views onto the various cluster nodes that constitute

the HAS architecture. The mapping of the software components to the nodes is highly flexible and

has minimal impact on the source code itself. We can deploy different physical configurations

depending on deployment and testing purposes.

Figure 57 illustrates the physical view of the architecture, which consists of the following resources:

master nodes, traffic nodes, redundant storage nodes, redundant LANs, one router per LAN, and two

redundant image servers. The HA tier consists of at least two redundant master nodes that accept

incoming traffic from the Internet through a virtual network interface that presents the cluster as a

single entity and provides cluster infrastructure services to all HAS cluster nodes. The HA tier does

not tolerate service downtime because if the master node (acting as dispatcher) goes down, the traffic

nodes are unable to receive traffic to service web clients. The HA tier controls the activity in the SSA

tier and therefore needs to be able to determine the health of traffic nodes and their load.

The SSA tier consists of a farm of independent and possibly diskless traffic nodes. The number of

nodes in this tier scales from two to N nodes. If a node in this tier fails, the dispatcher stops

forwarding traffic to the failed node, and forwards incoming traffic to the remaining available traffic

nodes.

The storage tier consists of at least two redundant specialized storage nodes.

117

The HAS architecture requires the availability of two routers (or switches) to provide a highly

available and reliable communication path.

An image server is a machine that holds the operating system and ramdisk images of the cluster

nodes. This machine, two for redundancy purposes, is responsible for propagating the images over the

network to the cluster nodes every time there is an upgrade or a new node joining the cluster. Master

nodes can provide the functionalities of the image server; however, for large deployments this might

slow down the performance of master nodes. Image servers are external and optional components to

the architecture.

Figure 57: The physical view of the HAS architecture

We divide the cluster components into the following functional units: master nodes, traffic nodes,

storage nodes, local networks, external networks, network paths, and routers.

The m master nodes form the HA tier in a HAS architecture and implement the 1+1 redundancy

model. The number of master nodes is m = 2. These nodes can be in the active/standby or

active/active mode. When m ≥ 2, then the redundancy model is the N+M model; however, we do not

implemented this redundancy model in the HAS prototype. The t traffic nodes are located in the SSA

tier, where t ≥ 2. If t = 1, then there is a single traffic node that constitutes a SPOF. Let s indicate the

number of storage nodes. If s = 0, then the cluster does not include specialized storage nodes; instead

master nodes in the HA tier provide shared storage using a highly available distributed file system.

The HA file system uses the disk space available on the master nodes to host application data. When s

Master Node B

LAN 1 LAN 2

CLUSTER

VIP

HA Tier SSA Tier

Traffic Node 4

Traffic Node 3

Master Node A

Traffic Node 2

Traffic Node 1

Storage Node 2

Storage Node 1

Storage Tier

Master Node B

LAN 1 LAN 2

CLUSTER

VIP

HA Tier SSA Tier

Traffic Node 4

Traffic Node 3

Master Node A

Traffic Node 2

Traffic Node 1

Storage Node 2

Storage Node 1

Storage Tier

Router 2Router 1 Image Server 1

Image Server 2Server 2

UsersUsers

Master Node B

LAN 1 LAN 2

CLUSTER

VIP

HA Tier SSA Tier

Traffic Node 4

Traffic Node 3

Master Node A

Traffic Node 2

Traffic Node 1

Storage Node 2

Storage Node 1

Storage Tier

Master Node B

LAN 1 LAN 2

CLUSTER

VIP

HA Tier SSA Tier

Traffic Node 4

Traffic Node 3

Master Node A

Traffic Node 2

Traffic Node 1

Storage Node 2

Storage Node 1

Storage Tier

Router 2Router 1 Image Server 1

Image Server 2Server 2


118

≥ 2, it indicates that at least two specialized nodes are providing storage. When s ≥ 2, we introduce

the notion of d shared disks, where d ≥ 2 x s. The d shared disks are the total number of shared disks

in the cluster. The l local networks provide connectivity between cluster nodes. For redundancy

purposes, l ≥ 2 to provide redundant network paths. However, this is dependent on two parameters:

the number of routers r available (r ≥ 2, one router per network path) and the number of network

interfaces eth available of each node (one eth interface per network path). The HAS architecture

requires a minimum of two Ethernet cards per cluster node; therefore eth ≥ 2. The cluster can be

connected to outside networks, identified as e, where e ≥ 1 to recognize that the cluster is connected

to at least one external network.

4.16 The Physical Storage Model of the HAS Architecture

The storage model of the HAS architecture aims to meet two essential requirements. The first

requirement is for the storage to be highly available, ensuring that data is always available to cluster

users and applications. This requirement enables tolerance of single storage unit failure as mirrored

data is hosted on at least two physical units; it also allows node failure tolerance, as mirrored data can

be accessed form different nodes. The second requirement is to have high throughput and minimum

access delay. Based on these requirements, we propose three possible physical storage access models

for the HAS architecture: using the HA implementation of the NFS, using specialized shared storage,

or using local node storage. The following sub-sections explore these storage access methods and

discuss their architectures and redundancy models.

4.16.1 Without Shared Storage

Figure 58 illustrates the no-shared storage model. This model offers a limited number of advantages;

however, we list it for completeness. Following this model, cluster nodes use their local storage to

store configuration and application data. While the no-shared model offers low cost, it comes at the

expense of not being able to use diskless nodes. In addition, application data is not replicated since

state information is saved on each node where the transaction took place. If a node becomes

unavailable, the data that resides on this node becomes unavailable too. As a result, the no-shared

storage model is limited and poses restrictions on cluster nodes and their applications.

119

Figure 58: The no-shared storage model

However, it is worthy to mention that other research projects (Section 2.10.5) have adopted this

model as their preferred way of handling data and dividing it across multiple traffic nodes [82].

Following their architectures, a traffic node receives a connection only if it has the data stored locally.

4.16.2 Shared Storage with Distributed File Systems

Distributed file systems enable file system access from a client through the network. This model

enables cluster node location transparency as each cluster node has direct access to the file system

data and it is suitable for applications with high storage capacity and access performance

requirements.

Figure 59 illustrates the physical model of the HAS architecture using shared storage. In this example,

cluster nodes rely on the networked storage available through the master nodes. Master nodes run the

modified implementation of the NFS server (Section 6.1.9) in which two NFS servers provide

redundant shared storage to cluster nodes. In the event of a failure of one of the NFS servers, the other

NFS server continues to provide access to data. We contributed this mechanism with a modified

implementation of the mount program (Section 6.1.9) that allow us to mount two NFS servers into the

same location, where the client data resides.

Master Node B

LAN 1LAN 2

CLUSTER

VIP

HA Tier SSA Tier

Traffic Node 4

Traffic Node 3

Master Node A Traffic

Node 2

Traffic Node 1

Node ALocal Disk

Node BLocal Disk

LocalDisk

LocalDisk

LocalDisk

LocalDisk

HeartbeatHeartbeat

Master Node B

LAN 1LAN 2

CLUSTER

VIP

HA Tier SSA Tier

Traffic Node 4

Traffic Node 3

Master Node A Traffic

Node 2

Traffic Node 1

Node ALocal Disk

Node BLocal Disk

LocalDisk

LocalDisk

LocalDisk

LocalDisk

HeartbeatHeartbeat

120

Figure 59: The HAS storage model using a distributed file system

Figure 60 illustrates how the HAS architecture achieves NFS server redundancy. In Figure 57-A,

master-a, is the name of the Master Node A server, and master-b is the name of the Master Node B

server. Both master nodes are running the modified HA version of the network file system server.

Using the modified mount program, we mount a common storage repository on both master nodes: % mount –t nfs master-a,master-b:/mnt/CommonNFS

Figure 60: The NFS server redundancy mechanism

Master Node B

LAN 1 LAN 2

Traffic Node N

Master Node A

Traffic Node 2

Traffic Node 1

LocalDisk

LocalDisk

LocalDisk

DataSynchronization

Node BLocal Disk

Node ALocal Disk

DFS

Heartbeat

…

HA Tier SSA Tier

CLUSTER

VIP

Master Node B

LAN 1 LAN 2

Traffic Node N

Master Node A

Traffic Node 2

Traffic Node 1

LocalDisk

LocalDisk

LocalDisk

DataSynchronization

Node BLocal Disk

Node ALocal Disk

DFS

Heartbeat

…

HA Tier SSA Tier

CLUSTER

VIP

Master Node A(master-a)

Primary NFS Server

/mnt/CommonNFS

Node A Local Disk

Master Node B(master-b)

Node B Local Disk

LAN 1

LAN 2

Master Node APrimary NFS Server

Node A Local Disk

Master Node BSecondary NFS Server

LAN 1

LAN 2

/mnt/CommonNFS

(Fig 57-A) Storage view from outside the cluster: One storage repository.

(Fig 57-B) Changes in contents is synchronized with the rsync utility.

Node B Local Disk

/mnt/CommonNFS

Secondary NFS Server

Synchronization is performed by the rsync utility

Common storage available on 2 NFS servers

Master Node A(master-a)

Primary NFS Server

/mnt/CommonNFS

Node A Local Disk

Master Node B(master-b)

Node B Local Disk

LAN 1

LAN 2

Master Node APrimary NFS Server

Node A Local Disk

Master Node BSecondary NFS Server

LAN 1

LAN 2

/mnt/CommonNFS

(Fig 57-A) Storage view from outside the cluster: One storage repository.

(Fig 57-B) Changes in contents is synchronized with the rsync utility.

Node B Local Disk

/mnt/CommonNFS

Secondary NFS Server

Synchronization is performed by the rsync utility

Common storage available on 2 NFS servers

121

When the rsync utility detects a change in the contents, it performs the synchronization to ensure that

both repositories are identical. If the NFS server on master-a becomes unavailable, data requests to

/mnt/CommonNFS will not be disturbed because the secondary NFS server on master-b is still running

and hosting the /mnt/CommonNFS network file system.

4.16.2.1 Synchronization of Shared Storage

The rsync utility is open source software that provides incremental file transfer between two sets of

files across the network connection, using an efficient checksum-search algorithm [120]. It provides a

method for bringing remote files into synch by sending just the differences in the files across the

network link [121].

The rsync utility can update whole directory trees and file systems, preserves symbolic links, hard

links, file ownership, permissions, devices and times, and uses pipelining of file transfers to minimize

latency costs. It uses ssh or rsh for communication, and can run in daemon mode, listening on a

socket, which is used for public file distribution.

We used the rsync utility to synchronize data on both NFS servers running on the two master nodes in

the HA tier of the HAS architecture.

4.16.2.2 Disk Replication Block Device

DRBD (Disk Replication Block Device) is an alternative to rsync to provide highly available storage.

DRBD is system software that acts as a block device. It operates by mirroring a whole block device

via the network and provides the synchronization between the data storage on both master nodes

[122]. DRBD takes over the data, writes it to the local disk, and sends it to the other storage server.

DRBD is a distributed replicated block device responsible for carrying the synchronization between

the two independently running NFS servers on two separate nodes.

Figure 61 illustrates the data replication with DRDB. If the active node fails, the heartbeat switches

the secondary device into primary state and starts the application there. If the failed node becomes

available again, it becomes the new secondary node and it synchronizes its NFS content to the

primary master node. This synchronization takes place as a background process and does not interrupt

the service.

122

Figure 61: DRDB disk replication for two nodes in the 1+1 active/standby redundancy model

The DRBD utility provides intelligent resynchronization as it only resynchronizes those parts of the

device that have changed, which results in less synchronization time. It grants read-write access only

to one node at a time, which is sufficient for the usual fail-over HA cluster.

The drawback of the DRBD approach is that it does not work when we have two active nodes

because of possibly multiple writes to the same block. If we have more than one node concurrently

modifying distributed devices, we have an interesting problem to decide which part of the device is

up-to-date on which node, and what blocks we need to resynchronize in which direction.

4.16.3 Storage with Specialized Storage Nodes

The HAS architecture allows the addition of specialized storage nodes that provide shared storage to

the cluster nodes. Figure 62 illustrates the physical model of the architecture using a storage area

network provided by two redundant specialized storage nodes. Following this model, storage for

application data is hosted on both of the specialized storage nodes. The traffic nodes can use their

private disk to maintain configuration and system files. We did not experiment providing storage

using specialized storage nodes.


Network Clients

Network

Active Master Node

Standby Master Node

Local Disk for Active Master Node

Local Disk for Standby Master Node

Disk Replication

(DRDB)


Network Clients

Network

Active Master Node

Standby Master Node

Local Disk for Active Master Node

Local Disk for Standby Master Node

Disk Replication

(DRDB)

123

Figure 62: A HAS cluster with two specialized storage nodes

4.17 Types and Characteristics of the HAS Cluster Nodes

A cluster is a dynamic entity consisting of a number of nodes configured to be part of the cluster.

Node can join or leave the cluster at anytime. Sections 4.27.3 and 4.27.8 describe the sequence

diagrams for a node joining and leaving our HAS architecture prototype. A cluster node is the logical

representation of a physical node. A node is an independent compute entity that is a member of a

cluster and resides on a common network. It communicates with all the cluster nodes through two

local networks, enabling tolerance against single local network failure. A cluster node can be a single

or dual processor machine or an SMP machine. An independent copy of the operating system

environment usually characterizes each cluster node. However, some cluster nodes share a single boot

image from the central, shared disk storage unit or an image server.

There are three types of nodes in the HAS architecture: master, traffic and storage nodes. The

following sub-sections present the characteristics of each node type and discuss its responsibilities

and roles within the cluster.

4.17.1 Master Nodes

Master nodes reside in the HA tier of the HAS architecture. They have access to persistent storage,

and depending on the specific deployment, they provide shared disk access to traffic nodes and host

cluster configuration, management and application data. Master nodes have local disk storage, unlike

traffic nodes that can be diskless.

Master Node B

LAN 1 LAN 2

CLUSTER

VIP

HA Tier SSA Tier

Traffic Node 4

Traffic Node 3

Master Node A

Traffic Node 2

Traffic Node 1

Specialized Storage Node 1

Storage Tier


Master Node B

LAN 1 LAN 2

CLUSTER

VIP

HA Tier SSA Tier

Traffic Node 4

Traffic Node 3

Master Node A

Traffic Node 2

Traffic Node 1


Storage Tier


124

Figure 63 illustrates the software and hardware stack of a master node in the HAS architecture.

Figure 63: The master node stack

Master nodes provide an IP layer abstraction hiding all cluster nodes and provide transparency

towards the end user. Master nodes have a direct connection to external networks. They do not run

server applications; instead, they receive incoming traffic through the cluster virtual IP interface and

distribute it to the traffic nodes using the traffic manager and a dynamic distribution mechanism

(Section 4.23). Master nodes provide cluster-wide services for the traffic nodes such as DHCP server,

IPv6 router advertisement, time synchronization, image server, and network file server. Master nodes

run a redundant and synchronized copy of the DHCP server, a communications protocol that allows

network administrators to manage centrally and automate the assignment of IP addresses. The

configuration files of this service are available on the HA shared storage. The router advertisement

daemon (radvd) [123] runs on the master nodes and sends router advertisement messages to the local

Ethernet LANs periodically and when requested by a node sending a router solicitation message.

These messages are specified by RFC 2461 [124], Neighbor Discovery for IP Version 6, and are

required for IPv6 stateless autoconfiguration. The time synchronization server, running on the master

nodes, is responsible for maintaining a synchronized system time. In addition, master nodes provide

the functionalities of an image server.

When the SSA tier consists of diskless traffic nodes, there is a need for an image server to provide

operating system images, application images, and configuration files. The image server propagates

this data to each node in the cluster and solves the problem of coordinating operating system and

application patches by putting in place and enforcing policies that allow operating system and

software installation and upgrade on multiple machines in a synchronized and coordinated fashion.

Master nodes can optionally provide this service. In addition, master nodes provide shared storage via

Processors

Interconnect Protocol (IPv4 and IPv6)

Operating System (Linux)

NTP, DHCP, CVIP, TM, saru, Ethd, HBD, DHCP

Interconnect TechnologyEthernet (TCP/IP/UDP)

Processors





125

a modified, highly available version of the network file server. We also prototyped a modified mount

program to allow master nodes to mount multiple servers over the same mounting point. Master

nodes can optionally provide this service.

4.17.2 Traffic Nodes

Traffic nodes reside in SSA tier of the HAS architecture. Traffic nodes can be of two types: either

with disk or diskless. Traffic nodes rely on cluster storage facilities for application data. Diskless

traffic nodes rely on the image server to provide them with the needed operating system and

application images to run and all necessary configuration data.

Figure 64 illustrates the software and hardware stack of a traffic node in the HAS architecture.

Figure 64: The traffic node stack

Traffic nodes run the Apache web server application. They reply to incoming requests, forwarded to

them by the traffic manager running on the master nodes. Each traffic node runs a copy of the traffic

client, LDirectord, and the Ethernet redundancy daemon.

Traffic nodes rely on cluster storage to access application data and configuration files, as well as for

cluster services such as DHCP, FTP, NTP, and NFS services. Traffic nodes have the option to boot

from the local disk (available for nodes with disks), the network (two networks for redundancy

purposes), flash disk (for CompaqPCI architectures), from CDROM, DVDROM, or floppy. The

default booting method is through the network. Traffic nodes also run the NTP client daemon, which

continually keeps the system time in step with the master nodes. With the HAS prototype, we

experienced booting traffic nodes from the local disk, the network, and from flash disk which we

mostly used for troubleshooting purposes.

Processors



TCD, LDirectord,Ethd

Apache Web Server Software


Processors



TCD, LDirectord,Ethd



126

4.17.3 Storage Nodes

Cluster storage nodes provide storage that is accessible to all cluster nodes. Section 4.16 presents the

physical storage model of the HAS architecture.

4.18 Local Network Access

Cluster nodes are interconnected using redundant dedicated links or local area networks (LAN). All

nodes in the cluster are visible to each other. We assume, following the physical model, that all

cluster nodes are one routing hop from each other. We achieve redundancy by using two local

networks to interconnect all the nodes of the cluster. All nodes have two Ethernet adapters, with each

connected to a separate LAN.

Figure 65 illustrates the local network access model where each cluster node connects to both LAN1

and LAN2; LAN1 and LAN2 are separate local area networks with their own subnet or domain.

Figure 65: The redundant LAN connections within the HAS architecture

This connectivity model ensures high availability access to the network and prevents the network of

being a SPOF. The HAS architecture supports both Internet Protocols IPv4 and IPv6. Supporting

IPv4 does not imply additional implementation considerations. However, supporting IPv6 requires

the need for a router advertisement daemon that is responsible for automatic configuration of IPv6

Ethernet interfaces. The router advertisement daemon also acts as an IPv6 router: sending router

advertisement messages, specified by RFC 2461 [124] to a local Ethernet LAN periodically and when

Master Node B

LAN 1 LAN 2

CLUSTER

VIP

Traffic Node 4

Traffic Node 3

Master Node A

Traffic Node 2

Traffic Node 1

Storage Node 2

Storage Node 1

Router 2Router 1

Ethernet Interface 1 Ethernet Interface 2

Master Node B

LAN 1 LAN 2

CLUSTER

VIP

Traffic Node 4

Traffic Node 3

Master Node A

Traffic Node 2

Traffic Node 1

Storage Node 2

Storage Node 1

Router 2Router 1

Ethernet Interface 1 Ethernet Interface 2

127

requested by a node sending a router solicitation message. These messages are required for IPv6

stateless autoconfiguration. As a result, in the event we need to reconfigure networking addressing for

cluster nodes, this is achievable in a transparent fashion and without disturbance to the service

provided to end users.

4.19 Master Nodes Heartbeat

It is necessary to detect when master nodes fail and when they become available again. Heartbeat is

the system software that provides this functionality in the HAS architecture prototype. It is an internal

system software to the HAS architecture prototype. Heartbeat is an open source project that provides

system software that runs on both master nodes over multiple network paths for redundancy purposes

[125]. Its goal is to ensure that master nodes are alive through sending heartbeat packets to each other

as defined in its configuration file. The heartbeat component is a low-level component that monitors

the presence and health of master nodes in the HA tier of a HAS architecture by sending heartbeat

packets across the network to the other instances of heartbeat running on other master nodes as a sort

of keep-alive message.

The heartbeat program takes the approach that the keep-alive messages, which it sends, are a specific

case of the more general cluster communications service [126]. In this sense, it treats cluster

membership as joining the communication channel, and leaving the cluster communication channel as

leaving the cluster. Heartbeat itself acts similarly to a cluster-wide init daemon, making sure each of

the services it manages is running at all times. When a master node stops receiving the heartbeat

packets, it assumes that the other master node died; as a result, the services the primary master node

was providing are failed over to the standby master node.

Figure 66 illustrates the heartbeat topology. The HA tier consists of two master nodes following the

1+1 redundancy model. Heartbeat supports both configuration of the 1+1 redundancy model: the

active/standby and active/active configurations. Each master node sends heartbeat and administrative

messages to the other master node as broadcasts.

128

Figure 66: The topology of the heartbeat Ethernet broadcast

With heartbeat, master nodes are able to coordinate their role (active and standby) and track their

availability. Heartbeat discussions are presented in [20], [127], and [128].

4.20 Traffic Nodes Heartbeat using the LDirectord Module

In the same way, the heartbeat mechanism (Section 4.19) checks the availability of the master nodes

in the HA tier, we need to provide a mechanism to verify that traffic nodes in the SSA tier are up and

running, and providing service to web clients. We use two methods to ensure that traffic nodes are

available and providing services. The first method is the keep-alive mechanism integrated in the

traffic distribution scheme discussed in Section 4.23. Each traffic node reports its load to the traffic

managers running on the master nodes every x seconds (x is a configuration parameter). This

continuous communication ensures that the traffic managers are aware of all available traffic nodes.

In the event that this communication is disrupted, after a pre-defined time out the traffic manager

removes the traffic node from its list of available traffic nodes. This communication ensures that the

traffic client is alive, and that it is reporting the load index of the traffic node to the traffic manager.

Section 4.23 discusses this mechanism.

The second method for traffic nodes heartbeat is an application check whose goal is to ensure that the

application server running on the traffic node is available and running. The application check relies

Router

MasterNode 1

1 1

1

2

2

2

MasterNode 2

Router

MasterNode 1

1 1

1

2

2

2

MasterNode 2

129

on the Linux Director daemon (LDirectord) to monitor the health of the applications running on the

traffic nodes. Each traffic node runs a copy of the LDirectord daemon.

The LDirectord daemon performs a connect check of the services on the traffic nodes by connecting

to them and making a HTTP request to the communication port where the service is running. This

check ensures that it can open a connection to the web server application. When the application check

fails, the LDirectord connects to the traffic manager and sets the load index of that specific traffic

node to zero. As a result, existing connections to the traffic node may continue, however the traffic

manager stop forwarding new connections to it. Section 4.27.12 discusses this scenario. This method

is also useful for gracefully taking a traffic node offline.

4.20.1 Sample LDirectord Configuration

The LDirectord module loads its configuration from the ldirectord.cf configuration file, which

contains the configuration options. An example configuration file is presented below. It corresponds

to a virtual web server available at address 192.68.69.30 on port 80, with round robin distribution

between the two nodes: 142.133.69.33 and 142.133.69.34. 1 # Global Directives 2 Checktimeout = 10 3 Checkinterval = 2 4 Autoreload = no 5 Logfile = "local0" 6 Quiescent = yes 7 8 # Virtual Server for HTTP 9 Virtual = 192.68.69.30:80 10 Fallback = 127.0.0.1:80 11 Real = 142.133.69.33:80 masq 12 Real = 142.133.69.34:80 masq 13 Service = http 14 Request = "index.html" 15 Receive = "Home Page" 16 Scheduler = rr 17 Protocol = tcp 18 Checktype = negotiate

Once the LDirectord module starts, the virtual server kernel table is populated. The capture below

uses the ipvsadm command line to capture the output of the kernel. The ipvsadm command is used

to set up, maintain or inspect the virtual server table in the Linux kernel. The listing illustrates the

130

virtual server session, with the virtual address on port 80, and the two hosts providing this virtual

service. % ipvsadm -L -n IP Virtual Server version 1.0.7 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 192.68.69.30:80 rr -> 142.133.69.33:80 Masq 1 0 0 -> 142.133.69.34:80 Masq 1 0 0 -> 127.0.0.1:80 Local 0 0 0

By default, the LDirectord module uses the quiescent feature to add and remove traffic nodes. When a

traffic node is to be removed from the virtual service, its weight is set to zero and it remains part of

the virtual service. As such, exiting connections to the traffic node may continue, but the traffic node

is not allocated any new connections. This mechanism is particularly useful for gracefully taking real

servers offline. This behavior can be changed to remove the real server from the virtual service by

setting the global configuration option quiescent=no.

4.21 CVIP: A Cluster Virtual IP Interface for the HAS Architecture

One of the challenges is to present the cluster as a single entity to the end users. To overcome the

disadvantage of existing solutions presented in Sections 2.9 and 2.10, the HAS architecture provides a

transparent and scalable interface between the internet and the cluster called the cluster virtual IP

interface. It is fault-tolerant and scalable interface between the Internet and the HAS architecture that

provide a single entry point to the HAS cluster. It should not have an impact on web clients,

applications and the existing network infrastructure. It should not present a SPOF, have minimal

impact on performance, and have minimal impact in case of failure. In addition, it should be scalable

to allow a large number of transactions (virtually an unlimited number) without posing a bottleneck,

and to allow the increase of the number of master nodes and traffic nodes independently.

Furthermore, it should support IPv4 and IPv6 and be application independent. One essential

requirement is that the CVIP interface should provide a single Support for multiple VIP addresses

As a result, the CVIP is a fault-tolerant and scalable method of interfacing a plurality of application

servers running on the HAS cluster. The CVIP framework includes a plurality of network

terminations (master servers in the HAS architecture) that receive incoming data packets from the

Internet, and a plurality of forwarding processes (traffic managers) that are associated with the

131

network terminations. The following sub-sections present the CVIP framework and discuss the

architecture and the various concepts.

4.21.1 The Basic Architecture

Figure 67 illustrates the generic configuration of the CVIP interface and the CVIP framework. It

presents a three layer architecture that consists of server application software (i.e. the web server

software) running on the traffic nodes, the master nodes that provide IP services, and the network

cards on the master nodes connecting the CVIP interface towards external networks and the Internet.

We achieve network bandwidth scalability by increasing the number of network terminations. We

achieve capacity scalability by increasing the number of servers in the SSA tier.

Figure 67: The CVIP generic configuration

Figure 68 illustrates the level of distribution within the HAS cluster using the CVIP as the interface

towards the outside networks. There exist two distribution points: the network termination, which

distributes packages based on IP address to correct master (or front end) node, and the traffic manager

that distributes network connections to the applications on the traffic node. Section 4.21.1.1 discusses

the network termination concept.

Intra/Internet

Server application software (traffic nodes)

IP servers (master nodes/front-end nodes)

Network cards (on master nodes and visible to the Internet)

Intra/Internet

Server application software (traffic nodes)

IP servers (master nodes/front-end nodes)

Network cards (on master nodes and visible to the Internet)

132

Figure 68: Level of distribution

4.21.1.1 The Concept of Network Termination

Figure 69 illustrates the concept of a network termination. A network termination is the last stop on

the network for a given connection (data packet) before it is forwarded for processing inside the HAS

architecture. The network termination of a web request packet is the network card interface on the

master node in the HA tier of a HAS cluster. Each network termination carries its own IP address and

can be addressed directly. These addresses are published via RIP/OSPF/BGP to the routers indicating

them as gateway addresses for the CVIP address; this means that routers see each termination just as

another router.

Processor





Processor



TCD, LDirectord, Ethd



Distribution of IP packages from/to external world to net termination "front ends”

Processor





Processor






Distribution of IP packages from net termination to web (or application) server

Processor





Processor






Processor






Distribution of IP packages from/to external world to net termination "front ends”

Processor





Processor






Distribution of IP packages from net termination to web (or application) server

133

Figure 69: Network termination concept

4.21.2 The CVIP as a Framework

The CVIP is an interfacing method and a framework that provides high fault tolerance and close to

linear scalability of the servers and the network interfaces. The framework is transparent to the web

clients and to the HAS cluster servers, and has minimal impact on the surrounding network

infrastructure. The CVIP operates with applications that run on top of IP.

Figure 70 illustrates the CVIP framework. The CVIP interface receives incoming data packets from

the Internet or a packet data network and passes packets to the traffic manager (TMD) running on an

IP server (i.e. a master node) in the HAS architecture. The IP server then forwards the request to an

already existing connection or a traffic distribution mechanism is invoked to determine the next

traffic node where a request will be sent. On a master node, the traffic manager is the entity

responsible of forwarding traffic to application servers running on traffic nodes in the SSA tier of the

HAS architecture based on a defined traffic distribution mechanism that is executed by the traffic

manager. The outgoing traffic takes a different path. The reply to the request goes directly from the

application server running on the traffic node to the web client, issuer of the original request.

CPU

LinuxKernel

Intra/Internet

• All Linux processors have their own IP address

• When in the 1+1 active/standby model, one front-end node (master node) is owner of the virtual IP address

• When in the 1+1 active/active model, each master node claims to be an IP router for the CVIP address

• More “Front Ends” can be added at runtime

• The OSPF protocol is used to monitor router linksNetTermination

CPU

LinuxKernel

NetTermination

CPUCPU

LinuxKernelLinuxKernel

Intra/Internet

• All Linux processors have their own IP address

• When in the 1+1 active/standby model, one front-end node (master node) is owner of the virtual IP address

• When in the 1+1 active/active model, each master node claims to be an IP router for the CVIP address

• More “Front Ends” can be added at runtime

• The OSPF protocol is used to monitor router linksNetTermination

CPUCPU

LinuxKernelLinuxKernel

NetTermination

134

Figure 70: The CVIP framework

4.21.3 Advantages of the CVIP

The following sub-sections present the CVIP advantages such as a single point of entry to the cluster,

scalability and high availability advantages, transparency and support for multiple application servers.

4.21.3.1 Transparency and single entry point to the cluster

CVIP is transparent to the applications running on traffic nodes and web clients are not aware of it.

CVIP supports multiple protocols and applications that use TCP, UDP and raw IP sockets, are able to

use it transparently. CVIP hides the cluster internals from the users and makes the cluster visible to

the outside world as a single entity through a virtual IP address. For the outside world, service is

available through a certain web address, which is the virtual cluster IP address that masks behind it

the IP addresses of the master nodes. It allows access to a cluster of processors via a single IP address.

We can also define a number of CVIP addresses in a system and achieve communication between

web clients and applications using different CVIP addresses in the cluster.

Intra/Internet

=

Active Master Node

Traffic Node

TMD

TCD

HTTPD

Traffic Node

TCD

HTTPD

Traffic Node

TCDHTTPD

Traffic Node

TCD

TMD

HTTPD

=HB HB

Standby Master Node


routed routed

Intra/Internet

=

Active Master Node

Traffic Node

TMD

TCDTCD

HTTPDHTTPD

Traffic Node

TCDTCD

HTTPDHTTPD

Traffic Node

TCDTCDHTTPDHTTPD

Traffic Node

TCDTCD

TMDTMD

HTTPDHTTPD

=HBHB HBHB

Standby Master Node


routedrouted routedrouted

135

4.21.3.2 Scalable

The CVIP offers a unique scalability advantage. We can increase network terminations, master nodes,

or traffic nodes independently and without any affecting how the cluster is presented to the outside

world.

With CVIP, we can cluster multiple servers to use the same virtual IP address and port numbers over

a number of processors to share the load. As we add new nodes in the HA and SSA tiers, we increase

the capacity of the system and its scalability. The number of clients or servers using the virtual IP

address is not limited. The framework is scalable and we can add more servers to increase the system

capacity. In addition, although we have only presented HTTP servers, the applications on top may

include server application that runs on IP such as an FTP server for file transfer.

4.21.3.3 Fault tolerance

The interface hides errors that take place on the HAS nodes and it is always available. Any crash in

the application server is transparent to end users. In the event the web server software crashes, only

1/N of the ongoing transactions are lost (only if there is no connection synchronization between

master nodes). All ongoing transactions can be saved and the state information can be preserved.

4.21.3.4 Availability

Since CVIP supports multiple servers, it does not provide a SPOF. In the HAS architecture prototype,

CVIP was provided on two master nodes in the HA tier. If one master node crashes, the web clients

and web servers are not affected.

4.21.3.5 Dynamic connection distribution

The framework supports a traffic distribution mechanism discussed in Section 4.23 that provides web

clients with access to application servers running on traffic nodes in a transparent way. When the

CVIP receives incoming traffic, it is possible to choose between two different distribution algorithms:

the round robin distribution mechanism or the HAS distribution mechanism that provides dynamic

and high performance distribution (discussed in 4.23).

136

4.21.3.6 Support for multiple application servers

Since the CVIP interface operates at IP level and it is transparent to application servers running on

traffic nodes, then it is independent from the type of traffic that it accepts and forwards. As a result,

with CVIP, the HAS architecture supports all types of application servers that work at IP level.

4.21.4 CVIP Discussion

When the HA tier is configured for the 1+1 active/standby redundancy model, incoming connections

to the HAS cluster arrive at the active master node, owner of the CVIP address. When the HA tier is

configured for the 1+1 active/active redundancy model, incoming connections arrive at the elected

master node owner of the CVIP address, and then are distributed among the master nodes using the

saru module.

With the CVIP framework, we can start as many web servers as needed to handle increased traffic

with the entire set of web servers serving the same virtual IP address. The routed process includes a

local routing table that contains a list of IP addresses that can be used to reach specific client IP

addresses. The routed process contains information that is global to all processors, but it is also

available on each processor through local instances of the routed process.

The cluster IP interface operates at the IP level, enabling applications that run on top of IP to access

transparently application servers running on traffic nodes and gain scalability and high fault tolerance.

Whenever we add a new network card to a master node, we define its own local IP address. Through

the OSPF routing protocol, the rest of the IP network is informed that this network interface can be

used to receive CVIP traffic.

4.22 Connection Synchronization

When the HA tier follows the 1+1 active/standby redundancy model, the cluster achieves a higher

availability than a single node tier, since there are two master nodes, one is active and the second is

standby. When the active master node fails, the standby master node automatically takes over the IP

address of the virtual service, and the cluster continues to function. However, when a fail-over occurs

at the active master node, ongoing connections that are in progress terminate because the standby

master node does not know anything about them. To improve this situation and to prevent the loss of

ongoing connections, there is a need to provide the capabilities of synchronizing the ongoing

connections information between the active master node and the standby master node. As a result, the

137

cluster can minimize and eliminate the situation of lost connections caused by the failure of an active

master node. When the information of ongoing connections is synchronized between master nodes,

then if the standby master node becomes the active master node, it retains the information about the

currently established and active connections, and as a result, the new active master node continues to

forward their packets to the traffic nodes in the SSA tier.

4.22.1 The Challenge of Connection Synchronization

When a master node receives a packet for a new connection, it allocates the connection to a traffic

node. This allocation in effected by allocating an ip_vs_conn structure in the Linux kernel which

stores the source address and port, the address and port of the virtual service, and the traffic node

address and port of the connection. Each time a subsequent packet for this connection is received, this

structure is looked up, and the packet is forwarded accordingly. This structure is also used to reverse

the translation process, which occurs when NAT is used to forward a packet and to store persistence

information. Persistence is a feature whereby subsequent connections from the same web user are

forwarded to the same traffic node.

When fail-over occurs, the new master node does not have the ip_vs_conn structures for the active

connections. Therefore, when a packet is received for one of these connections, the new master node

does not know which real server to send it to. As a result, the connection breaks and the web user

need to reconnect. By synchronizing the ip_vs_conn structures between the master nodes, this

situation ca be avoided, and connections can continue after a master node fails-over.

4.22.2 The Master/Slave Approach to Connection Synchronization

The connection synchronization code relies on a sync-master/sync-slave setup where the sync-master

sends synchronization information and the sync-slave listens. The following example illustrates this

scenario [129]. There are two master nodes, master-A and master-B. Master-A is the sync-master and

the active master node. Master-B is the sync-slave and the standby master node.

In step 1 (Figure 71), the web user opens connection-1. Master node A receives this connection,

forwards it to a traffic node, and synchronizes it to the sync-slave, master node B.

138

Figure 71: Step 1 - Connection Synchronization

In step 2 (Figure 72), a fail-over occurs and the master node B becomes the active master node.

Connection-1 is able to continue because the connection synchronization took place in step 1.


The master/slave implementation of the connection synchronization works with two master nodes: the

active master node sends synchronization information for connections to the standby master node,

and the standby master node receives the information and updates its connection table accordingly.

The synchronization of a connection takes place when the number of packets passes a predefined

threshold and then at a certain configurable frequency of packets. The synchronization information

for the connections is added to a queue and periodically flushed. The synchronization information for

up to 50 connections can be packed into a single packet that is sent to the standby master node using

multicast. A kernel thread, started through an init script, is responsible for sending and receiving

synchronization information between the active and standby master nodes.

4.22.3 Drawbacks of the Master/Slave Approach

After experimenting with the master/slave approach for synchronizing connections, we realized that it

suffers from a drawback that is discoverable when the new master node (previously standby) fails and

Active Master Node Async-master

Standby Master Node Bsync-slave

web user opensconnection-1

connection-1 forwardedto traffic node

connection-1 synchronized





connection-1 synchronized

Standby Master Node Async-master

Active Master Node Bsync-slave

web user continuesconnection-1






139

the current standby node (previously active) becomes active again. To illustrate this drawback, we

continue discussing the example of connection synchronization from the previous section.

In Step 3 (Figure 73), a web user opens connection-2. Master node B receives this connection, and

forwards it to a traffic node. Connection synchronization does not take place because master node B

is a sync-slave.


In step 4 (Figure 74), another fail-over takes place and master node A is again the active master node.

Connection-2 is unable to continue because it was not synchronized.


4.22.4 Alternative Approach: Peer-to-Peer Connection Synchronization

The master/slave approach to connection synchronization is not optimum and therefore there is a need

for a different approach that eliminates the drawback in the existing implementation. Figure 75

illustrates the peer-to-peer approach. In this approach, each master node sends synchronization

information for connections that it is handling to the other master node. Therefore, in the scenario

illustrated in Figure 74, connections are synchronized from master node B to master node A;

connection-2 are able to continue after the second fail-over when master node A becomes the active

master node again. The implementation of this approach is a future work item.





No connection synchronization









No connection synchronization




connection-2 breaksActive Master Node Async-master



connection-2 breaks

140

Figure 75: Peer-to-peer approach

4.23 Traffic Management

Traffic management is an important aspect of the HAS architecture that contributes to building a

highly available and scalable web cluster. This section presents the requirements for a flexible and

scalable traffic distribution inside a web cluster. We discuss the needs of the HAS architecture for a

traffic management scheme and presents the HAS architecture solution that consists of a traffic

manager running on master nodes enforcing a distribution policy and a traffic client that runs on

traffic nodes and report the load index of the traffic node providing a dynamic cycle of feedback.

4.23.1 Background and Requirements

Traffic distribution is the process of distributing network traffic across a set of server nodes to

achieve better resource utilization, greater scalability, and high availability. Scalability is an important

factor because it ensures a rapid response to each network request regardless of the load. Availability

ensures the service continues to run despite failure of individual server nodes. Traffic distribution can

be passive or active technology depending on the specific implementation. Some traffic distribution

schemes do not modify network requests, but pass them verbatim to one of the cluster nodes and

returns the response verbatim to the client. Other schemes, such as NAT, change the request headers

and force the request to go through an intermediate server before it reaches the end user.

There exist many interesting aspects of a traffic distribution implementation that we considered when

designing and prototyping the HAS architecture traffic distribution scheme. These aspects include

optimizing of response time, copying with failures of individual traffic nodes, supporting

heterogeneous traffic nodes, ensuring the distribution service remains available, supporting session

persistence, and ensuring transparency so that users are not aware where the application is hosted and

if the server is a cluster or a single server.



Connections connections forwardedto traffic node

Connections synchronized



Connections connections forwardedto traffic node

Connections synchronized

141

Our survey of identical works (Sections 2.9 and 2.10) has identified that added performance from

complex algorithms is negligible. The recommendations were to focus on a distribution algorithm that

is uncomplicated, has a low overhead, and that minimizes serialized computing steps to allow for

faster execution.

Scalable web server clusters require three core components: a scheduling mechanism, a scheduling

algorithm, and an executor. The scheduling mechanism directs clients’ requests to the best web

server. The scheduling algorithm defines the best web server to handle the specific request. The

executor carries out the scheduling algorithm using the scheduling mechanism. The following sub-

section present these three core components in the HAS architecture.

4.23.2 Traffic Management in the HAS Architecture

Traffic management is a core technology that enable better scalability in the HAS architecture. It

consists of three elements that work together to achieve efficient and dynamic traffic distribution

among cluster traffic nodes. These three elements are the traffic manager running on the master

nodes, the traffic client daemons running on traffic nodes, and the traffic distribution policy. Traffic

nodes continuously report to the traffic managers their availability and their load index. The traffic

manager distributes incoming traffic based on the distribution policy.

When a request, such as that for a particular page of a website, arrives at a master node, depending on

the policy defined in the configuration file of the traffic manager, the traffic manager forwards the

request to the appropriate traffic node. This scheme follows the LVS direct routing method, where the

traffic manager and the master nodes have one of their interfaces physically linked by a hub or a

switch. It is also the same approach as the L4/2 clustering method (Section 2.6.1).

When a user accesses a virtual service provided by the server cluster, the packet destined for the

virtual IP address arrives, the traffic manager forwards it to the appropriate traffic nodes, which

services the request and replies directly to the user.

4.23.3 Traffic Manager

The traffic manager runs on master nodes. It receives load announcements from the traffic client

daemons running on traffic nodes, and updates its internal list of traffic nodes and their load. It

maintains a list of available traffic nodes and their load index. The traffic manager distributes

incoming traffic to the available traffic nodes based on their load index. The traffic manager requires

142

a configuration file that lists the addresses of all traffic nodes, the traffic distribution policy, and the

port of communication, the timeout limit, and the addresses of the master nodes.

4.23.3.1 Sample Traffic Manager Configuration File

This section presents sample configuration file of a traffic manager daemon.

1 # TMD Configuration File

2 # List of master nodes

3 master1 <IP Address of Master Node 1>


5

6 # Active traffic nodes

7 # List the IP addresses of active traffic nodes

8 traffic1 <IP Address of Traffic Node 1>




12

13 # Distribution mechanism (either HAS or RR)

14 distribution HAS

15

16 # Standby traffic nodes - Used when in the NxM redundancy model

17 # If redundancy model is not NxM, then leave empty

18 # List the IP addresses of standby traffic nodes

19 <IP Address of Traffic Node 1>

20 <IP Address of Traffic Node 2>

21 <IP Address of Traffic Node M>

22

23 # Port number

24 port <port_number>

25

26 # Reporting errors – needed for troubleshooting purposes

27 ErrorLog <full_path_to_error_log_file>

28

29 # The Timeout option specifies the amount of time in ms the TMD

30 # waits to receive load info from TCD. Timeout should be > than

31 # the update frequency of the TCD

32 # See TCD configuration file for TCD specific information

143

33 timeout <timeout_value>

4.23.4 The Proc File System

The /proc file system is a real time, memory resident file system that tracks the processes running

on the machine and the state of the system, and maintains highly dynamic data on the state of the

operating system. The information in the /proc file system is continuously updated to match the

current state of the operating system. The contents of the /proc files system are used by many

utilities which read the data from the particular /proc directory and display it.

The traffic client uses two parameters from /proc to compute the load_index of the traffic node:

the processor speed and free memory. The /proc/cpuinfo file provides information about the

processor, such as its type, make, model, cache size, and processor speed in BogoMIPS [128]. The

BogoMIPS parameter is an internal representation of the processor speed in the Linux kernel.

Figure 76 illustrates the contents of the /proc/cpuinfo file at a given moment in time and

highlights the BogoMIPS parameter used to compute the load_index of the traffic node. The

processor speed is a constant parameter; therefore, we only read the /proc/cpuinfo file once when

the TC starts.

Figure 76: The CPU information available in /proc/cpuinfo

% more /proc/cpuinfoprocessor : 0vendor_id : GenuineIntelcpu family : 6model : 13model name : Intel(R) Pentium(R) M processor 1.70GHzstepping : 6cpu MHz : 598.186cache size : 2048 KBfdiv_bug : nohlt_bug : nof00f_bug : nocoma_bug : nofpu : yesfpu_exception : yescpuid level : 2wp : yesflags : fpu vme de pse tsc msr mce cx8 sep mtrr pge mcacmov pat clflush dts acpi mmx fxsr sse sse2 ss tm pbe est tm2bogomips : 1185.43

% more /proc/cpuinfoprocessor : 0vendor_id : GenuineIntelcpu family : 6model : 13model name : Intel(R) Pentium(R) M processor 1.70GHzstepping : 6cpu MHz : 598.186cache size : 2048 KBfdiv_bug : nohlt_bug : nof00f_bug : nocoma_bug : nofpu : yesfpu_exception : yescpuid level : 2wp : yesflags : fpu vme de pse tsc msr mce cx8 sep mtrr pge mcacmov pat clflush dts acpi mmx fxsr sse sse2 ss tm pbe est tm2bogomips : 1185.43

144

The /proc/meminfo file reports a large amount of valuable information about the RAM usage. The

/proc/meminfo file contains information about the system's memory usage such as current state of

physical RAM in the system, including a full breakdown of total, used, free, shared, buffered, and

cached memory utilization in bytes, in addition to information on swap space.

Figure 77 illustrates the contents of the /proc/meminfo file at a given moment in time and

highlights the MemFree parameter used to compute the load_index of the traffic node. Since the

MemFree is a dynamic parameter, it is read from the /proc/meminfo file every time the TC

calculates the load_index.

Figure 77: The memory information available in /proc/meminfo

4.23.5 Traffic Client

The traffic client is a daemon process that runs on each traffic node. It collects processor and memory

information from the /proc file system and reports it to the traffic managers running on master

nodes. The implementation of the traffic client requires a configuration file (Section 4.23.5.1) that

lists the addresses of master nodes, port of communication, timeout limits, and the logging directive.

% more /proc/meminfoMemTotal: 775116 kBMemFree: 6880 kBBuffers: 98748 kBCached: 305572 kBSwapCached: 2780 kBActive: 300348 kBInactive: 286064 kBHighTotal: 0 kBHighFree: 0 kBLowTotal: 775116 kBLowFree: 6880 kBSwapTotal: 1044184 kBSwapFree: 1040300 kBDirty: 16 kBWriteback: 0 kBMapped: 237756 kBSlab: 171064 kBCommitted_AS: 403120 kBPageTables: 1768 kBVmallocTotal: 245752 kBVmallocUsed: 11892 kBVmallocChunk: 232352 kBHugePages_Total: 0HugePages_Free: 0

% more /proc/meminfoMemTotal: 775116 kBMemFree: 6880 kBBuffers: 98748 kBCached: 305572 kBSwapCached: 2780 kBActive: 300348 kBInactive: 286064 kBHighTotal: 0 kBHighFree: 0 kBLowTotal: 775116 kBLowFree: 6880 kBSwapTotal: 1044184 kBSwapFree: 1040300 kBDirty: 16 kBWriteback: 0 kBMapped: 237756 kBSlab: 171064 kBCommitted_AS: 403120 kBPageTables: 1768 kBVmallocTotal: 245752 kBVmallocUsed: 11892 kBVmallocChunk: 232352 kBHugePages_Total: 0HugePages_Free: 0

145

In the event of failure of the traffic client daemon, the traffic manager does not receive a load

notification and after a timeout, it removes the traffic node from its list of available traffic nodes. As a

result, no disruption of service results from the failure of the traffic client daemon. Section 4.27.9

discusses this scenario.

4.23.5.1 Sample Traffic Client Configuration File

This section presents sample configuration file of a traffic client daemon.

1 # TC Configuration File

2 # List of master nodes to which the TC daemon reports load



5

6 # Port # to connect to at master - this port number can be anywhere

7 # between 1024 and 49151. We can also use ports 49152 through 65535

8 port <port_number>

9

10 # Frequency of load updates in ms.

11 updates <frequency_of_updates>

12

13 # Reporting errors -- needed for troubleshooting purposes

14 ErrorLog <full_path_to_error_log_file>

15

16 # Number of RAM in the node with the least RAM in the cluster

17 RAM <num_of_ram>

4.23.6 Characteristics of the Traffic Management Scheme

The traffic management scheme supports static distribution with round robin without taking into

consideration the load of traffic nodes. It also supports dynamic distribution that is configurable to

support various intervals of load updates. The HAS cluster can include computers with heterogeneous

hardware such as with different processor speeds and memory capacity. The traffic distribution

mechanism is aware of the variation in computing power since the processor speed, reported in

BogoMIPS, and memory reported in MB, are factored by the traffic manager when forwarding traffic

to the nodes. As a result, the master node assigns traffic to the traffic nodes based on their load index.

The traffic manager uses the following formula to calculate the load index of each traffic node:

146

⎟⎟⎠

⎞⎜⎜⎝

⎛ ∗=

xRAMaRAMaCPUaLoad

____

The formula consists of the following variables:

• CPU_a is the BogoMIPS representation of the processor speed of traffic node a

• RAM_a is the number of free RAM in MB of traffic node a, and

• RAM_x is the number of total RAM in MB of traffic node x, where traffic node x is the node with

the least amount of RAM among all traffic nodes. The configuration file of the traffic client

specifies this parameter (Section 4.23.5.1).

The result is the relative load of traffic node a.

Let us consider an example of how the formula is applied. Consider a traffic node with a Pentium

Mobile processor running at a speed of 1.7 GHz with 748 MB of RAM. The BogoMIPS processor

speed is 3358,72. We assume that the node has 512 MB of free RAM. Let us consider as well that the

machine with the lowest amount of RAM has 256 MB of RAM. Therefore, the load of that machine

as reported by the traffic client daemon to the traffic manager is:

6,717262,144

524,288*72,3358_ ≈⎟⎠

⎞⎜⎝

⎛=aLoad

The traffic manager maintains a list of nodes and their loads. Figure 78 illustrates a case example of a

HAS cluster that consists of eight traffic nodes.

Figure 78: Example list of traffic nodes and their load index

When the traffic manager receives an incoming request, it examines the list of nodes and forwards the

request to the least busy node on the list. The list of traffic nodes is a sorted linked list that allows us

to maintain an ordered list of nodes without having to know ahead of time how many nodes we will

192.168.1.100 4484

192.168.1.101 4397

192.168.1.102 4353

192.168.1.103 4295

192.168.1.104 4235

192.168.1.105 4190

192.168.1.106 4097

192.168.1.107 3973

Least Busy Traffic Node

Most Busy Traffic Node

Node IP Address Load Index

192.168.1.100 4484

192.168.1.101 4397

192.168.1.102 4353

192.168.1.103 4295

192.168.1.104 4235

192.168.1.105 4190

192.168.1.106 4097

192.168.1.107 3973

Least Busy Traffic Node

Most Busy Traffic Node

Node IP Address Load Index

147

be adding. To build this data structure, we used two class modules: one for the list head and another

for the items in the list. The list is a sorted linked list; as we add nodes into the list, the code finds the

correct place to insert them and adjusts the links around the new nodes accordingly.

4.23.7 Example of Operation

This section provides a description of a normal scenario of traffic client reporting load to traffic

manager. Figure 79 illustrates this scenario.

Figure 79: Illustration of the interaction between the traffic client and the traffic manager

(1) The traffic client reads the /proc file system and retrieves the BogoMIPS processor speed from

/proc/cpuinfo, and then retrieves the amount of free memory from /proc/meminfo.

(2) The traffic client computes the node load index based on the formula in Section 4.23.6.

(3) The traffic client reports the load index to the traffic manager as a string that consists of pair

parameters: the traffic node IP address and the load index (traffic_node_IP, load_index).

(4) The traffic manager receives the load index and updates its internal list of traffic nodes to reflect

the new load_index of the traffic_node_IP.

(5) For illustration purposes, we assume that an incoming request reaches the virtual interface.

TMTM

HBDHBD

routed

LDirectord

TCTC

/proc/cpuinfo/proc/meminfo/proc/cpuinfo/proc/meminfo

/proc file system

sarusaru

UsersUsers

routed

TMTM

1

2

3

4

5

6

7

DistributedStorage

Web ServerWeb

Server

9

Master Node 1Master Node 1Master Node 2Master Node 2 Traffic Node 1Traffic Node 1

8

TMTM

HBDHBD

routed

LDirectord

TCTC


/proc file system

sarusaru


routed

TMTM

1

2

3

4

5

6

7

DistributedStorage

Web ServerWeb

Server

9

Master Node 1Master Node 1Master Node 2Master Node 2 Traffic Node 1Traffic Node 1

8

148

(6) The routed daemon forwards the request to the traffic manager. The traffic manage examines the

list of traffic nodes and chooses a traffic node as the target for this request.

(7) The traffic manages forwards the request to the traffic node.

(8) The web server running on the traffic node receives the request. It reads from the distributed

storage to retrieve the requested document and fetches it from storage.

(9) The web server sends the request document directly to the web user.

This scenario also illustrates several drawbacks such as the number of steps involved and the

communication overhead between the system software. One of the future work items is to minimize

the number of system software resulting in less communication overhead.

4.23.8 Support for Other Distribution Algorithms

The current traffic manager prorotype supports the basic RR distribution and the HAS distribution.

The distribution method is specified in the configuration file of the traffic manager (Section 4.23.3.1): # Distribution mechanism (HAS or RR, or others)

distribution <distribution-scheme>

As a result, when the traffic manager starts, it enforces the specified traffic distribution method. If the

distribution method is RR, then the traffic clients running on traffic nodes continue to report their

load, however, the list of traffic nodes will not be kept sorted based on the load of the traffic nodes.

Instead, the table remains static, and reflects the list of traffic nodes available in the HAS cluster. The

traffic manager rotates through the table and distributes incoming traffic in a RR fashion to the list of

traffic nodes it maintains. Furthermore, the current model can support additional distribution methods.

4.23.9 Discussion of the HAS Architecture Traffic Management

The traffic management scheme is lightweight and configurable. The traffic management scheme is

able to cope with failures of individual traffic nodes. It supports a cluster that consists of

heterogeneous cluster nodes with different processor speed and memory capacity.

In case the master node fails, the traffic manager becomes unavailable. In this case, the standby

master node takes over and the standby traffic manager becomes responsible for distributing

incoming traffic. The scheme is transparent to web clients who are unaware that the applications run

on a cluster of nodes.

149

4.24 Access to External Networks and the Internet

The two classical methods to access external networks are the direct access method and the restricted

method. In the direct access method, traffic node can reply to web clients directly. In the restricted

method, traffic nodes forward their response to one of the master nodes that then rewrites the

response header and forwards it to the web client. With the latest method, access to external networks

is restricted by master nodes in the HA tier, which is achieved using forwarding, filtering, and

masquerading mechanisms. As a result, master nodes monitor and filter all accesses to the outside

world. The HAS architecture supports both methods as this is architecture independent and depends

on configuration.

Based on the survey of similar work (Sections 2.9 and 2.10), the direct access method is the most

efficient traffic distribution method that help improve the system scalability. This model is supported

by the HAS cluster prototype and with which we have performed our benchmarking tests.

Figure 80 illustrates the scenario of direct access. When a request arrives to the cluster (1), the master

node examines it, decides where the request should be forwarded, and forwards it (3) to the

appropriate traffic node. The request reaches the traffic node that treats it, and replies directly (4) to

the web client. Furthermore, the HAS architecture supports the restricted access method, as this is

implementation specific.

Figure 80: The direct routing approach – traffic nodes reply directly to web clients

Figure 81 illustrates the scenario of restricted access, which requires re-writing of IP packets. When a

request arrives to the cluster (1), the master node examines it, decides where the request should be

Master Node B

LAN 1 LAN 2

CLUSTER

VIP

Traffic Node 2

Traffic Node 1

Master Node A

1

2 3

4

End UsersEnd Users

Master Node B

LAN 1 LAN 2

CLUSTER

VIP

Traffic Node 2

Traffic Node 1

Master Node A

1

2 3

4

End UsersEnd UsersEnd UsersEnd Users

150

forwarded, rewrites the packets, and forwards it (3) to the appropriate traffic node. The request

reaches the traffic node that treats it and replies (4) to the master node. The master node then re-writes

the packets (5) and sends the final reply to the web client (6).

Figure 81: The restricted access approach – traffic nodes reply to master nodes, who in turn

reply to the web clients

4.25 Ethernet Redundancy

The Ethernet redundancy daemon (erd) runs after the "route" configuration data has been installed,

following boot up. In order to have link redundancy, is it necessary to configure the paired ports

(eth0 and eth1) with the same IP and MAC addresses.

The "erd" program starts by fetching the configuration of all routes and then storing them into a table

(loaded from /proc/net/route). The table is used to compose route deletion commands for all

routes bound to the secondary link (eth1). The route configuration for the primary link (eth0) is then

copied to the secondary link (eth1) with a higher metric (meaning a lower priority). The primary link

(eth0) is then polled every x milliseconds (x is a configurable parameter in the configuration file).

When a primary link fails, all routing information is deleted for the link, causing traffic to be switched

to the secondary link. Upon link restoration, the cached routing information for the primary link is

updated, forcing a traffic switchback.

4.25.1 Command Line Usage

The command has specific usage syntax: % erd [eth0 eth1]

Master Node B

LAN 1 LAN 2

CLUSTER

VIP

Traffic Node 2

Traffic Node 1Master Node A1

2

34

56End UsersEnd Users

Master Node B

LAN 1 LAN 2

CLUSTER

VIP

Traffic Node 2

Traffic Node 1Master Node A1

2

34

56End UsersEnd Users

151

The example shown above indicates that eth0 has eth1 as its backup link. If we do not specify

parameters in the command, it defaults to the equivalent of "erd eth0 eth1". We automated this

command on system startup.

4.25.2 Encountered Issues

The servers in our prototype use Tulip Ethernet cards. We patched the tulip.c driver to make the

MAC addresses for ports 0 and 1 identical. Alternatively, we also were able to get the same result by

issuing the following commands (on Linux) to set the MAC address for an Ethernet port:

% ifconfig eth[X] down

% iconfig eth[X] hw ether AA:BB:CC:DD:EE:FF

% ifconfig eth[X] up

We also modified the source code of the Ethernet device driver tulip.c to toggle the RUNNING bit

in the dev flags variable, which allows ifconfig to present the state of an Ethernet link. The

state of the RUNNING bit for the primary link is accessed by erd via the system call ioctl.

4.26 Dependencies and Interactions between Software Components

The HAS architecture consists of many software components (Section 4.4) that depend on each other

to provide service. When the work initially started, our goal was to minimize the amount of

dependencies so to reduce the possibilities of cascading failures. However, as the work progressed

and the architecture grew to support additional functionalities, the number of system components

grew in number as well, and their relation became more complex.

Figure 82 illustrates the dependencies and interconnections of the various system software

components within the HAS architecture.

As a future work item, we plan to consolidate the functionalities of these system software components

resulting in fewer components. The following subsections discuss the dependencies among these

software components, and describe the nature of their interactions.

152

Figure 82: The dependencies and interconnections of the HAS architecture system software

4.26.1 Traffic Manager and the CVIP

The CVIP interface is the point of entry to the HAS cluster. All incoming connections arrive at the

CVIP and are forwarded to the traffic manager running on the master node in the HA tier to decide

which traffic node is to service the incoming request. Therefore, the traffic manager depends on the

CVIP to receive incoming requests.

TMTM

TCTC

/proc file system/proc file system

routedrouted

sarusaru

LDirectordLDirectord

#

#

#

Depends on

Depends on, only in the 1+1 active/ active redundancy model

Mutual dependency

heartbeatheartbeat heartbeatheartbeat


Traffic Node 1

TCTC

/proc file system/proc file system LDirectordLDirectord

Traffic Node 2

…

Traffic Node n

Legend:

Dependencies across nodes

TMTM

TCTC

/proc file system/proc file system

routedrouted

sarusaru

LDirectordLDirectord

#

#

#

Depends on

Depends on, only in the 1+1 active/ active redundancy model

Mutual dependency

heartbeatheartbeat heartbeatheartbeat


Traffic Node 1

TCTC

/proc file system/proc file system LDirectordLDirectord

Traffic Node 2

…

Traffic Node n

Legend:

Dependencies across nodes

153

4.26.2 The Saru Module and the Heartbeat Daemon

When the HA tier is in the active/active redundancy model, the saru module runs in coordination with

heartbeat on each of the master nodes. The saru module is responsible for dividing the incoming

connections between the two master nodes. The heartbeat daemon provides a mechanism to

determine which master node is available and the saru module uses this information to divide the

space of all possible incoming connections between both active master nodes.

4.26.3 Traffic Manager and Traffic Client

Traffic managers running on master nodes depend on the traffic client daemons running of traffic

node to provide traffic distribution based on the load index of the traffic nodes. In the event that

traffic client daemon does not send the load index to the traffic managers, traffic managers are not

able to distribute traffic accordingly. Therefore, we need to ensure that traffic managers are aware

when a traffic node becomes unavailable, when a traffic client daemon fails and when the application

fails on the traffic node.

4.26.4 Heartbeat and CVIP

The active master node owns the address to the virtual service. If the active master node fails, the

standby master node becomes the active node (notified by the heartbeat mechanism) and the owner of

the virtual service. Therefore, there is a dependency between the heartbeat mechanism and the cluster

virtual IP interface.

4.26.5 Heartbeat Instances

Each master node runs a copy of the heartbeat daemon. The heartbeat instance running on the master

nodes communicate with each other to ensure that they are available. If the heartbeat on the standby

master node does not receive a reply from the heartbeat instance running on the active master node,

then it declares the active master node as dead, and assumes the active state. The cron utility monitors

the heartbeat daemon locally on each master node, and when the heartbeat daemon fails, the cron

utility re-starts it automatically.

4.26.6 LDirectord and Traffic Manager

The LDirectord module communicates with the traffic manager to ensure that the traffic manager

does not send web traffic to a traffic node that hosts a failing web server application. The LDirectord

154

module reports to the traffic manager the IP address of the traffic node it is running on and the

load_index pre-set to the value of 0 (zero). Section 4.27.12 presents this scenario.

4.26.7 The LDirectord Module and the Traffic Client

The traffic client reports the load of the traffic node to the traffic manager. We use also this

mechanism as a health check between the traffic client and the traffic manager. If the traffic client

fails to report within a specific time, and the timeout is exceeded, then the traffic manager consider

the traffic node as unavailable and removes it from its list of active nodes. However, there is a case

scenario where the traffic client is reporting the load index to the traffic manager but it is unaware

that the application it is hosting is unavailable (has failed). The role of the LDirectord module is to

ensure that the traffic client does not update the load_index while the application is not responsive.

Therefore, the LDirectord sets the traffic client load_index_report_flag to 0. When the

load_index_report_flag = 0, the traffic client stops reporting its load to the traffic manager.

When the application check performed by LDirectord returns positive (the application is up and

running), the LDirectord sets the load_index_report_flag to 1. As a result, the traffic client

restarts to report its load index to the traffic manager and then the traffic manager adds the traffic

node to its list of available nodes.

4.26.8 The Saru Module and Traffic Manager

The traffic manager can be a SPOF in the 1+1 active/active redundancy model and therefore we need

a mechanism to ensure continuous availability of the functions of a traffic manager. The saru module

is the entity that ensures that in the event a traffic manager fails, the cluster virtual interface stops

sending incoming traffic to it and instead sends traffic to the available traffic manager on the other

master node.

The saru module is active on one master node when the nodes in the HA tiers follow the in 1+1

active/active redundancy model. It is responsible of evenly dividing traffic between the active master

nodes, using the round robin algorithm. Let us assume that the saru module is running on active

master node 1. When a web request arrives to the saru module from the routed process, the request is

now within master node 1. The saru module forwards it to the appropriate traffic manager depending

on the round robin algorithm. Based on this, the traffic manager depends on the saru module to

receive traffic.

155

4.26.9 The saru Module and routed Process

The saru module relies on the routed process to receive incoming traffic from the cluster virtual IP

interface.

4.26.10 Traffic Client and the /proc File System

The traffic client daemon depends on the /proc file system to retrieve memory and processor usage,

which are metrics needed to computer the load index of a traffic node.

4.27 Scenario View of the Architecture

The scenario view consists of a small subset of important scenarios – instances of use cases – to

illustrate how the elements of the HAS architecture work together. For each scenario, we describe the

corresponding sequences of interactions between objects and between processes. This view acts as a

driver to help us validate and illustrate the architecture design, both on paper and as the starting point

for the benchmarking tests of the architectural prototype.

The following sub-sections examine are these scenarios:

- Normal scenario: The request arrives to the CVIP. The traffic manager forwards it to a traffic

node, and the traffic nodes replies to the web client.

- Traffic client daemon reporting its load: This scenario presents how a traffic node computes and

reports its load index to the traffic managers running on master nodes.

- Adding a traffic node to the SSA tier: This scenario examines the steps involved adding a traffic

node to the SSA tier.

- Boot process of a traffic node: This scenario examines the boot process of a traffic node. There

are two variations of this scenario depending on whether the traffic node is diskless or it has local

disk.

- Upgrading the operating system and application server software: This scenario describes generic

upgrading the kernel and applications on a HAS cluster node.

- Upgrading the hardware of a master node: This scenario presents the process of upgrading a

hardware component on a master node.

- Boot process of a master node: This scenario examines the boot process of a master node.

- Master node becomes unavailable: This scenario illustrates what happens when a master node

becomes unavailable.

156

- Traffic node becomes unavailable: In some cases, the traffic node can become unavailable

because of hardware or software error. This scenario illustrates how the cluster reacts to the

unresponsiveness of a traffic node.

- Ethernet port becomes unavailable: A cluster node can face networking problems because of

Ethernet card or Ethernet drivers issues. This scenario examines how a HAS cluster node reacts

when it faces Ethernet problems.

- Traffic node leaving the cluster: When a traffic node is not available to serve traffic, the traffic

manager disconnects it from the cluster. This scenario illustrates how a traffic node leaves the

cluster.

- Application server process dies on a traffic node: When the application becomes unresponsive, it

stops serving traffic. This scenario examines how to recover from such a situation.

- Network becomes unavailable: This scenario presents the chain of events that take place when the

network to which the cluster is connected, becomes unavailable.

4.27.1 Normal Scenario

The normal scenario describes how the HAS cluster successfully processes an incoming web request.

The process of the request varies depending on the redundancy model at the HA tier. Therefore, this

scenario examines two variations: the HA tier configured for the 1+1 active/standby redundancy

model, and then when it is configured for the 1+1 active/active redundancy model.

4.27.1.1 Normal Scenario, 1+1 Active/Standby

Figure 83 illustrates the sequence diagram of a successful request. This scenario assumes that the

nodes in the HA tier follow the 1+1 active/standby redundancy. The web user issues a request for a

specific service that the cluster provides through its virtual interface (1). The CVIP interface receives

the request and forwards it to the traffic manager running on the active master node (2). The traffic

manager locates the target node from its list of available traffic nodes (3), and forwards the request to

the appropriate traffic node based on the traffic distribution algorithm (4). The traffic node receives

the requests, processes it, and returns the reply directly to the web user (5).

157

Figure 83: The sequence diagram of a successful request with one active master node

4.27.1.2 Normal Scenario, 1+1 Active/Active

Figure 84 illustrates the sequence diagram of an incoming request that arrives at the HA tier of the

HAS cluster, with two active master nodes. This scenario assumes that the nodes in the HA tier

follow the 1+1 active/active redundancy.The web user accesses a service hosted on web server (1).

The request arrives at the cluster IP interface (2). Master node 1 is elected by the saru module to own

the CVIP interface and receive incoming requests from the CVIP. Therefore, the request arrives to

master node 1 (3). The saru module forwards it to the traffic manager running on master node 2 (4).

The saru module is responsible of evenly dividing traffic in between the active master nodes, using

round robin algorithm. The traffic manager on master node 2 checks (5) for the best available traffic

node from its list of available traffic nodes based on the distribution algorithm, and then forwards the

request to it (6). The web server application receives the request, processes it, and replies directly to

the user (7).

Traffic

Node ACVIP

Traffic Manager (TM)

Active Master Node

1

3

4

User sends request to virtual IP address

Request arrives at Traffic Node A

Traffic Node A replies to user

2

Users

Requests arrives to the TM on that master node

TM checks for the best available traffic node

5

Traffic

Node ACVIP


Active Master Node

1

3

4


Request arrives at Traffic Node A

Traffic Node A replies to user

2

UsersUsers

Requests arrives to the TM on that master node


5

158

Figure 84: The sequence diagram of a successful request with two active master nodes

4.27.2 Traffic Node Reporting Load Status

Figure 85 illustrates the sequence diagram of a traffic node reporting its load information to the traffic

managers running on master nodes. This scenario assumes that each traffic node runs a copy of the

traffic client daemon. When the traffic client daemon starts, it loads its configuration from the

/etc/tcd.conf configuration file. As a result, it obtains the IP addresses of the master nodes to

which it reports its load, the port number, the frequency of the load index update, and the location

where it logs errors. The traffic client daemon then reads the processor and memory information from

the /proc file system (1), computes the load index (Section 4.23.5), and reports it (2) to traffic

managers running on the master nodes. The traffic managers receive load index of the traffic node,

and update their list of available traffic nodes and their load indexes (4)(5). The frequency of

reporting the load of the traffic node is defined in the traffic client configuration file. Due to the

repetitive nature of this activity, and since the processor information is static, unlike memory

information, the processor information is read from the /proc file system on the first time the traffic

client is started.

Traffic

Node CCVIP


Active Master Node 1

1


Request is sent to Master Node 2

2

Users

User sends request to virtual IP address Master node 1 is

elected by saru to receive incoming requests from VIP and divide it among the active master nodes



5

6

Traffic Node C replies to user


7

3

4

Request arrives at Traffic Node C

Traffic

Node CCVIP



1


Request is sent to Master Node 2

2

UsersUsers

User sends request to virtual IP address Master node 1 is

elected by saru to receive incoming requests from VIP and divide it among the active master nodes



5

6

Traffic Node C replies to user


7

3

4

Request arrives at Traffic Node C

159

Figure 85: A traffic node reporting its load index to the traffic manager

4.27.3 Adding a Traffic Node to the HAS Cluster

The HAS architecture allows the addition of traffic nodes to the cluster in response to increased

traffic transparently and dynamically. The HAS cluster administrator configures the traffic node to

boot from LAN. The traffic node boots and downloads a kernel image and a ramdisk (Section

4.27.4.1). The ramdisk includes three software components that are started automatically: the

Ethernet redundancy daemon, the LDirectord, and the traffic client.

Figure 86 illustrates the scenario of a traffic node joining the HAS cluster. When the traffic client

starts, it reads the processor speed and available memory from the /proc file system and computes

the load index (1). The traffic client then reports its load index to the traffic managers running on the

master nodes in the HA tier as a pair of traffic node IP address and the load index (2)(3). The traffic

managers receive the notification and update their list of available traffic nodes (4)(5). Next, the

traffic node is ready to service incoming requests and the traffic manager starts forwarding traffic to

the new traffic node.

Traffic node load index

TCD retrieves CPU load and free memory, and computes the load index as:

Master Node 1

Traffic Manager

Master Node 2

Traffic ManagerTraffic Node B

Traffic Client Daemon (TCD)

TCD knows about the traffic managers running on master nodes from its configuration file

1

(4) and (5): Update list of available of traffic nodes and their load

⎟⎟⎠

⎞⎜⎜⎝

⎛ ∗=

RAMlowestRAMfreeCPUindexLoad

___


2

3

Communication Port

Traffic Manager Daemon

At this point, traffic node B is added to the list of available traffic nodes. The traffic manager starts forwarding incoming traffic to traffic node B.

5 4

List of available traffic nodes

Before update

After update


TCD retrieves CPU load and free memory, and computes the load index as:

Master Node 1

Traffic Manager

Master Node 2

Traffic ManagerTraffic Node B


Traffic Node B


TCD knows about the traffic managers running on master nodes from its configuration file

1


⎟⎟⎠

⎞⎜⎜⎝

⎛ ∗=

RAMlowestRAMfreeCPUindexLoad

___


2

3

Communication Port



5 4


Before update

After update

160

Figure 86: A traffic node joining the HAS cluster

4.27.4 Boot Process of a Traffic Node

There are two types of traffic nodes, diskless and nodes with local disk, and each boots differently.

The following sub-sections describe both booting methods, but first, we examine how the cluster

operates. The nodes in the cluster boot from the network. When the nodes boot, they broadcast their

Ethernet MAC address looking for a DHCP server. The DHCP server, running on the master nodes

and configured to listen for specific MAC addresses, responds with the correct IP address for the

nodes. Alternately, the DHCP server responds to any broadcast on its physical network with IP

information from a designated range of IP addresses. The nodes receive the network information they

need to configure their interfaces: IP addresses, gateway, netmask, domain name, the IP address of

the image and boot server, and the name of the boot file. Next, the nodes download a kernel image

from the boot server, which can also be the master node. The boot server responds by sending a

network loader to the client node, which loads the network boot kernel. The network boot loader

mounts the root file system as read-only; the network loader reads the network boot kernel sent from

the boot server into local memory and transfers control to it. The kernel mounts root as read/write,

1

TCD retrieves CPU load and free memory, computes the load index and reports it to TMs running on master nodes

Master Node 1


The traffic client daemon is aware of the master nodes since theIP addresses of those nodes are provided in its configuration file.

Master Node 2



5


4


Traffic node load index Communication Port3

2

Traffic Node B




Before update

After update

1

TCD retrieves CPU load and free memory, computes the load index and reports it to TMs running on master nodes

Master Node 1


The traffic client daemon is aware of the master nodes since theIP addresses of those nodes are provided in its configuration file.The traffic client daemon is aware of the master nodes since theIP addresses of those nodes are provided in its configuration file.

Master Node 2



5


4


Traffic node load index Communication Port3

2

Traffic Node B




Before update

After update

161

mounts other file systems, and starts the init process. The init process brings up the customized Linux

services for the node, and the node is now fully booted and all initial processes are started.

4.27.4.1 Boot Process of a Diskless Traffic Node

Figure 87 illustrates the process of adding a diskless node to a running HAS cluster.

Figure 87: The boot process of a diskless node

1. Ensure that the MAC address of the NIC on that diskless node is associated with a traffic node

and configured on the master nodes as a diskless traffic node. The notion of diskless is important

since the traffic node will download from the imager server a kernel and ramdisk image. Traffic

nodes have their BIOS configured to do a network boot. When the administrator starts traffic

nodes, the PXE client that resides in the NIC ROM, sends a DHCP_DISCOVER message.

2. The DHCP server, running on the master node, sends the IP address for the node with the address

of the TFTP server and the name of the PXE bootloader file that the diskless traffic node should

download.

3. The NIC PXE client then uses TFTP to download the PXE bootloader.

4. The diskless traffic node receives the kernel image (diskless_node) and boots with it.

5. Next, the diskless traffic node sends a TFTP request to download a ramdisk.

6. The image server sends the ramdisk to the diskless traffic node. The diskless traffic node

downloads the ramdisk and executes it.

DisklessTraffic Node

DHCP_DISCOVER (PXE Client)

IP Address, TFTP Server, rsync Servers, Name of PXE bootloader

TFTP Request (PXE Client)

PXE bootloader (diskless_node)

TFTP Request (diskless_node)

TFTP (diskless_node_ramdisk)

1

2

3

4

5

6

DHCP/ImageServer

DisklessTraffic Node






TFTP (diskless_node_ramdisk)

1

2

3

4

5

6

DHCP/ImageServer

162

When the diskless traffic node executes the ramdisk, the traffic client daemon starts and reports the

load of the node periodically to the master nodes. The traffic manager, running on the master nodes,

adds the traffic nodes to its list of available traffic nodes and starts forwarding traffic to it.

4.27.4.2 Boot Process of a Traffic Node with Disk

There are two scenarios to add a node with disk to the HAS cluster. The scenarios differ if the traffic

node has the latest versions of the operating system and the application server or if the traffic node

requires the upgrade of either the operating system or the application server. Figure 88 illustrates the

sequence diagram of a traffic node with disk that is booting from the network. If the traffic node

requires the upgrade of either the operating system or the application server, the traffic node

downloads the updated operating system image and then a ramdisk image including the newer version

of the web server application.

Figure 88: The boot process of a traffic node with disk – no software upgrades are performed

Figure 89 illustrates the process of upgrading the ramdisk on a traffic node. To rebuild a traffic node

or upgrade the operating system and/or the ramdisk image, we re-point the symbolic link in the

DHCP configuration to execute a specific script, which results in the desired upgrade. At boot time,

the DHCP server checks if the traffic node requires an upgrade and if so, it executes the

corresponding script.


IP Address and network configuration1

2

Traffic Node

With Disk

DHCP/ImageServer


IP Address and network configuration1

2

Traffic Node

With Disk

DHCP/ImageServer

163

Figure 89: The process of rebuilding a node with disk

4.27.5 Upgrading Operating System and Application Server

The architecture promises service continuity even in the event of upgrading the operating system

and/or the application servers running of the traffic nodes. Figure 90 illustrates the events that take

place when we initiate an upgrade of both the kernel and the application servers on a specific traffic

node. While a traffic node is undergoing a software upgrade, traffic arrives to the cluster and the

traffic manager forwards it to the available traffic nodes. Following this model, we upgrade the

software stack on the cluster nodes without having a service downtime.






IP_addr config file (linux_install)

TFTP (node_with_disk)

Boot kernel, initrd

FTP (node_disk_ramdisk)

Boot kernel with full ramdisk image

1

2

3

4

5

6

7

8

9

10

Traffic Node with

Disk

DHCP/ImageServer






IP_addr config file (linux_install)

TFTP (node_with_disk)

Boot kernel, initrd

FTP (node_disk_ramdisk)

Boot kernel with full ramdisk image

1

2

3

4

5

6

7

8

9

10

Traffic Node with

Disk

DHCP/ImageServer

164

Figure 90: The process of upgrading the kernel and application server on a traffic node

4.27.6 Upgrading Hardware on Master Node

Figure 91 illustrates the events that take place when a master node is undergoing a hardware upgrade.

The administrator of the system shuts down (1) the master node 1, which becomes offline (2). The

heartbeat service on master node 1 is not available anymore (3) to reply to heartbeat messages sent to

it from the heartbeat instance running on master node 2. After a timeout (specified in the heartbeat

configuration file), master node 2 becomes the active master node (4) and the owner of the virtual

services offered by the cluster. As a result, all incoming traffic arrives to the only available master

node (5). The traffic manager on that master node forwards traffic to the appropriate traffic nodes,

and responses are sent back to the web users.

System administrator reboots the node. On reboot the node is flagged for a new stage

1 X

Request a restage with new kernel and/or new version of the application server

Image server uploads new kernel image to the traffic node

5 Node boots with new kernel

Node signals it booted with new kernel version

3

Image server uploads new version of the application server

8 Node starts the new version of the application server

Traffic

Node B

4

7

6

ImageServer

System administrator reboots the node. On reboot the node is flagged for a new stage

1 X

Request a restage with new kernel and/or new version of the application server

Image server uploads new kernel image to the traffic node

5 Node boots with new kernel

Node signals it booted with new kernel version

3

Image server uploads new version of the application server

8 Node starts the new version of the application server

Traffic

Node B

4

7

6

ImageServer

165

Figure 91: The sequence diagram of upgrading the hardware on a master node

4.27.7 Master Node Boot Process

The process of booting a master node is different from that of a traffic node; the cluster administrator

configures the master nodes, while master nodes manage traffic nodes. The cluster administrator

installs all the software packages and tunes the configuration files for all system software components

running on the master nodes. The components that are required include the network time protocol

server, the cluster IP interface, the Ethernet redundancy daemon, traffic manager, the IPv6 router

advertisement daemon, heartbeat daemon, DHCP, and the NFS redundant server daemon.

4.27.8 Master Node Becomes Unavailable

Master nodes supervise the availability of each other using heartbeat. With heartbeat, the failure of a

master node is detected within a delay 200 ms. One common scenario of a failure is when a master

node becomes unavailable because of a hardware problem or an operating system crash. It is critical

to have a mechanism in place to deal with such a challenge.

Figure 92 illustrates the sequence of events when master node 1 becomes unavailable. When the

master node 1 becomes unavailable (1), the heartbeat instance running on the master node 2 does not

Administrator shuts down master node 1

X2 Master Node 1 is offline

Heartbeat on standby Master Node 2 does not receive heartbeat message from Master Node 1 (timeout limit is exhausted)

4

Master Node 2 declares Master Node 1 unavailable and becomes active and owner of the CVIP interface

X

AdminAdmin

Master

Node 1

Master

Node 2

CVIP

(master node 2)

1

3

UsersUsers

New incoming traffic

5

Response is sent back to the user

The request is sent to the appropriate traffic node for processing

Administrator shuts down master node 1

X2 Master Node 1 is offline

Heartbeat on standby Master Node 2 does not receive heartbeat message from Master Node 1 (timeout limit is exhausted)

4

Master Node 2 declares Master Node 1 unavailable and becomes active and owner of the CVIP interface

X

AdminAdmin

Master

Node 1

Master

Node 1

Master

Node 2

Master

Node 2

CVIP

(master node 2)

1

3

UsersUsers

New incoming traffic

5

Response is sent back to the user

The request is sent to the appropriate traffic node for processing

166

receive the keep-alive message from the master node 1 and after a configurable timeout (2), master

node 2 become the active master node (3) and owner of the cluster virtual interface (4)(5). There is no

impact on storage because the content of the storage system is duplicated and synchronized on both

master nodes (Section 4.7.4).

Figure 92: The sequence diagram of a master node becoming unavailable

Figure 93 illustrates the sequence diagram of synchronizing storage when one of the master nodes

fails.

Figure 93: The NFS synchronization occurs when a master node becomes unavailable

1 X Master Node 1 is deadThe NFS implementation detects the failure of NFS server on Master Node 1.X2

Synchronization with rsync

3 XMaster Node 1 is Back online

The NFS implementation detects that the NFS server on Master Node 1 is back online

X4

5

Master Node 2

HA NFS DaemonMaster Node 1

Redundant NFS Servers were mounted on traffic nodes using the modified mount command which allows mounting 2 servers towards the same mount target

1 X Master Node 1 is deadThe NFS implementation detects the failure of NFS server on Master Node 1.X2

Synchronization with rsync

3 XMaster Node 1 is Back online

The NFS implementation detects that the NFS server on Master Node 1 is back online

X4

5

Master Node 2

HA NFS Daemon

Master Node 2

HA NFS DaemonMaster Node 1Master Node 1



X1

3

Master Node 1 is unavailable due to a major failure Heartbeat on Master

Node 1 does not send heartbeat message to the heartbeat instancerunning on Master Node 2 (timeout limit is exhausted)

X2

The heartbeat instance on Master Node 2 declares Master Node 1 unavailable and Makes Master Node2 as primary

Master Node 2 Is owner of the virtual services

4

Master

Node 1

Master Node 1

Heartbeat

Master Node 2

Heartbeat

5

Master Node 2

CVIPUsers

New requests from web user arrive to Master Node 2

6

X1

3

Master Node 1 is unavailable due to a major failure Heartbeat on Master

Node 1 does not send heartbeat message to the heartbeat instancerunning on Master Node 2 (timeout limit is exhausted)

X2

The heartbeat instance on Master Node 2 declares Master Node 1 unavailable and Makes Master Node2 as primary

Master Node 2 Is owner of the virtual services

4

Master

Node 1

Master

Node 1

Master Node 1

Heartbeat

Master Node 1

Heartbeat

Master Node 2

Heartbeat

Master Node 2

Heartbeat

5

Master Node 2

CVIP

Master Node 2

CVIPUsersUsers

New requests from web user arrive to Master Node 2

6

167

When the initial active master node becomes available again for service, there is no need for a

switchback to active status between the two master nodes. The new master node acts as a hot standby

for the current active master node. As a future work, we would like the standby master node to switch

to the load sharing mode (1+1 active/active), helping the active master node to direct traffic to the

traffic nodes once the active master node reaches a pre-defined threshold limit. When master node 1

becomes available again (3), the NFS server is re-started; it mounts the storage and re-syncs its local

content with the master node 2 using the rsync utility.

4.27.9 Traffic Node becomes Unavailable

Traffic nodes, similarly to master nodes, can face problems and become unavailable to serve

incoming traffic. The HAS cluster architecture overcomes this challenge by supporting node level

redundancy. Figure 94 illustrates the scenario when a traffic node becomes unavailable.

Figure 94: The sequence diagram of a traffic node becoming unavailable

When a traffic node becomes unavailable (1), the traffic client daemon (running on that node)

becomes unavailable and does not report the load index to the master nodes. As a result, the traffic

manager daemons do not receive the load index from the traffic node (2). After a timeout, the traffic

1

3

X Traffic Node Cbecomes unavailable

Traffic managers do not receive the load index from traffic node C. X2

Traffic managers remove Traffic Node C from theirlists of traffic nodes

4 XTraffic Node Cis back online

Traffic Client sends load index to TM on Master Node

5

6

Traffic managers receiveload index and add Traffic Node C to their list of available traffic nodes

Traffic

Node C

Master Node 1

Traffic Manager

Timeout – traffic managers stop waiting to receive the load index. They declare it unavailable.

3

X 2

6

Master Node 2

Traffic Manager

1

3

X Traffic Node Cbecomes unavailable

Traffic managers do not receive the load index from traffic node C. X2

Traffic managers remove Traffic Node C from theirlists of traffic nodes

4 XTraffic Node Cis back online

Traffic Client sends load index to TM on Master Node

5

6

Traffic managers receiveload index and add Traffic Node C to their list of available traffic nodes

Traffic

Node C

Master Node 1

Traffic Manager

Timeout – traffic managers stop waiting to receive the load index. They declare it unavailable.

3

X 2

6

Master Node 2

Traffic Manager

168

managers remove the traffic node from their list of available traffic nodes (3). However, if the traffic

nodes becomes available again (4), the traffic client daemon reports the load index to the traffic

manager running on the master node (5). When the traffic manager receives the load index from the

traffic node, it is an indication that the node is up and ready to provide service. The traffic manager

then adds the traffic node (6) to the list of available traffic nodes. A traffic node is declared

unavailable if it does not send its load statistics to the master nodes within a specific configurable

time.

Figure 95: The scenario assumes that node C has lost network connectivity

Figure 95 illustrates a traffic node losing network connectivity. The scenario assumes that traffic node

C lost network connectivity, and as a result, it is not a member of the HAS cluster. The traffic

manager now forwards incoming traffic is to the other remaining traffic nodes.

4.27.10 Ethernet Port becomes Unavailable

All cluster nodes are equipped with two Ethernet cards. The Ethernet redundancy daemon, running on

each cluster node, is responsible for detecting when an Ethernet interface becomes unavailable and

activating the second Ethernet interface. Figure 96 illustrates this scenario. When the Ethernet port 1

becomes unavailable (1), the Ethernet redundancy daemon detects the failure after a timeout (2) and

activates Ethernet port 2 with the same MAC address and IP address as Ethernet port 1 (3). From this

point further, all communication goes through Ethernet port 2.

Active Standby

LAN

Web Users


LAN

Web Users

Active Standby

Node C Node D Node C Node D

Traffic Node C is available and serving traffic .Traffic Node C leaves the cluster. Traffic is directed to other available nodes.

Active Standby

LAN

Web Users


LAN

Web Users

Active Standby

Node C Node D Node C Node D

Traffic Node C is available and serving traffic .Traffic Node C leaves the cluster. Traffic is directed to other available nodes.

169

Figure 96: The scenario of an Ethernet port becoming unavailable

4.27.11 Traffic Node Leaving the Cluster

If a traffic node does not report its load index to the traffic managers running on master nodes, the

later remove the traffic node from their list of available traffic nodes and stops forwarding traffic to it.

Figure 97 illustrates how a traffic node leaves the cluster.

Figure 97: The sequence diagram of a traffic node leaving the HAS cluster

Ethernet Port 1

2

Ethernet Redundancy Daemon detects the failure of Ethernet Port 1 and performs failover to Ethernet Port 2

X1Ethernet port is unavailable

Ethernet Port 2

Cluster Node

Ethernet Port 2 is now primary

3

Ethernet Redundancy

DaemonEthernet Port 1

2

Ethernet Redundancy Daemon detects the failure of Ethernet Port 1 and performs failover to Ethernet Port 2

X1Ethernet port is unavailable

Ethernet Port 2

Cluster Node

Ethernet Port 2 is now primary

3

Ethernet Redundancy

Daemon

Ethernet Redundancy

Daemon

1Traffic node is unavailable, suffering from hardware or software problem …

Master Node 1

Traffic Manager (TM) Traffic Node B

Traffic Client Daemon (TCD )

Master Node 2


2

TM does not receive traffic load index – the traffic node did not send out its load index

2

33

After timeout, TM removes the traffic node from its list of available traffic nodes

The traffic node was servicing request until (1) an error took placeand the node is not communicating with master nodes.

TM updating the list of available traffic nodes based on traffic nodes

reporting their load index



1Traffic node is unavailable, suffering from hardware or software problem …

Master Node 1

Traffic Manager (TM) Traffic Node B

Traffic Client Daemon (TCD )

Master Node 2


2

TM does not receive traffic load index – the traffic node did not send out its load index

2

33

After timeout, TM removes the traffic node from its list of available traffic nodes

The traffic node was servicing request until (1) an error took placeand the node is not communicating with master nodes.





170

When traffic managers stop receiving messages from the traffic node reporting its load index (1)(2),

after a defined timeout, the traffic managers remove the node from the list of available traffic nodes

(3). The scenario of a traffic node leaving the HAS cluster is similar to the scenario Traffic Node

Becomes Unavailable presented in Section 4.27.9.

4.27.12 Application Server Process Dies

The availability of service depends on the availability of the application process (in our cases, the

Apache web server). If the application crashes and becomes unavailable, we need a mechanism to

detect the failure and restart the application. This is important because a master node cannot detect

such a failure in the traffic nodes and as a result, the traffic manager continues to forward incoming

traffic to a traffic node that hosts a failing application. In addition, we need to maintain consistency

between traffic manager and traffic client.

Our approach to overcome this challenge requires two system software: the cron operating system

facility and the LDirectord module. The cron utility is a UNIX system daemon that executes

commands or scripts as scheduled by the system administrator. As such, the operating system

monitors the application and it re-starts it when it crashes. However, if the application crashes in

between cron checks, then the traffic manager continues to forward requests to the web server

running on the traffic node. New and ongoing requests fail because the web server is down.

Figure 98 illustrates our initial approach into addressing this challenge.

(1) The LDirectord performs an application check by making an HTTP request to the application and

checking the result. The check, performed every x milliseconds (x is a configurable parameter),

ensures that we can open a HTTP connection to the service on the traffic node.

(2) The LDirectord receives the result of the check. A positive result indicates that the application is

up and running; no action is required from LDirectord. A negative result indicates that the web

server did not respond as expected.

(3) If the check is negative, then the LDirectord connects to the traffic manager and reports that the

application on the traffic node is not available. Otherwise, the traffic manager continues to

forward requests to the traffic node on which the unresponsive application is located. The

LDirectord delivers the pair of (traffic_node_IP, load_index) to the traffic manager, with

the load_index = 0. A load_index = 0 ensures that the traffic managers stop forwarding

traffic to the specific traffic node.

171

(4) The traffic manager updates its list of available traffic nodes and stops forwarding traffic to the

traffic node.

(5) The LDirectord needs to ensure that the traffic client does not update the load_index while the

application is not responsive. The LDirectord sets the load_index_report_flag to 0.

(6) When the load_index_report_flag = 0, the traffic client stops reporting its load to the

traffic managers.

(7) On the next loop cycle, the LDirectord checks if the application is still unresponsive. If the

application is still not available, then no action is required from LDirectord.

(8) If the application checks returns positive then LDirectord connects to the traffic manager and

reset the load_index_report_flag to 1. When the load_index_report_flag = 1, the

traffic client resumes reporting its load to the traffic manager.

(9) The traffic client reports the new load_index that overwrites the 0 value.

(10) The traffic manager updates its list of available traffic nodes and starts forwarding traffic to

the traffic node.

Figure 98: The LDirectord restarting an application process

TMTM

HBD

routed

LDirectord

TCTC

apphbd

cron


/proc file system

saru

UsersUsers

2

3

5

6

8

9

Web ServerWeb

Server

Master Node 1Master Node 1 Traffic Node 1Traffic Node 1

1

4 10

7

TMTM

HBD

routed

LDirectord

TCTC

apphbd

cron


/proc file system

saru


2

3

5

6

8

9

Web ServerWeb

Server

Master Node 1Master Node 1 Traffic Node 1Traffic Node 1

1

4 10

7

172

4.27.13 Network Becomes Unavailable

In the event that one network becomes unavailable, the HAS cluster needs to survive such a failure

and switch traffic to the redundant available network. Figure 99 illustrates this scenario.

Figure 99: The network becomes unavailable

When the router becomes unavailable (3), Ethernet port 1 gets a reply with a timeout (5). At this

point, the Ethernet port 1 uses its secondary route through router 2. We can use the heartbeat

mechanism to monitor the availability of routers. However, since routers are outside our scope, we do

not pursue how to use heartbeat to discover and recover from router failures.

4.28 Network Configuration with IPv6

By default, cluster administrators configure the network setting on all cluster nodes with IPv4 through

either static configuration or using a DHCP server. However, with clusters consisting of tens and

hundreds of nodes, these methods are labor intensive and prone to many errors. Designer of network

protocols recognize the difficulty of installing and configuring TCP/IP networks. Over the years, they

have come up with solutions to overcome these pitfalls. Their latest outcome is a newly designed IP

protocol version, IPv6. One of IPv6's useful features is its auto-configuration ability. It does not

require a stateful configuration protocol such as DHCP. Hosts, in our case cluster nodes, can use

router discovery to determine the addresses of routers and other configuration parameters. The router

Ethernet Port 1

1

User sends request

Ethernet Port 2

Cluster Traffic Node

Switch/

Router 1

X Switch/Router becomes unavailable

Request is forwarded to Ethernet port 1 of Traffic Node

Reply is sent

Timeout

Reply is re-sent through Switch/Router 2

User receives reply

2

3

4

5

6

78

Users

Switch/

Router 2Ethernet Port 1

1

User sends request

Ethernet Port 2

Cluster Traffic Node

Switch/

Router 1

X Switch/Router becomes unavailable

Request is forwarded to Ethernet port 1 of Traffic Node

Reply is sent

Timeout

Reply is re-sent through Switch/Router 2

User receives reply

2

3

4

5

6

78

UsersUsers

Switch/

Router 2

173

advertisement message also includes an indication of whether the host should use a stateful address

configuration protocol.

There are two types of auto-configuration. Stateless configuration requires the receipt of router

advertisement messages. These messages include stateless address prefixes and preclude the use of a

stateful address configuration protocol. Stateful configuration uses a stateful address configuration

protocol, such as DHCPv6, to obtain addresses and other configuration options. A host uses stateful

address configuration when it receives router advertisement messages that do not include address

prefixes and require that the host use a stateful address configuration protocol. A host also uses a

stateful address configuration protocol when there are no routers present on the local link. By default,

an IPv6 host can configure a link-local address for each interface. The main idea behind IPv6

autoconfiguration is the ability of a host to auto-configure its network setting without manual

intervention.

Autoconfiguration requires routers of the local network to run a program that answers the

autoconfiguration requests of the hosts. The radvd (Router ADVertisement Daemon) provides these

functionalities. This daemon listens to router solicitations and answers with router advertisement.

Figure 100: The sequence diagram of the IPv6 autoconfiguration process

1 X Node boots

router advertisement, specifying subnet prefix,lifetimes, and default router.

Generate link local address

2

3

Send solicitation message

4

Generate own IP and perform duplicate address detection procedure

5

Master Node or Router

Router advertisement daemon is running on master nodesCluster nodes support the IPv6 protocol at the OS level

Traffic Node C

1 X Node boots

router advertisement, specifying subnet prefix,lifetimes, and default router.

Generate link local address

2

3

Send solicitation message

4

Generate own IP and perform duplicate address detection procedure

5

Master Node or Router

Router advertisement daemon is running on master nodesCluster nodes support the IPv6 protocol at the OS level

Traffic Node C

174

Figure 100 illustrates the process of auto-configuration. This scenario assumes that the router

advertisement daemon is started on at least one master node, and that cluster nodes support the IPv6

protocol at the operating system level, including its auto-configuration feature. The node starts (1). As

the node is booting, it generates its link local address (2). The node sends a router solicitation

message (3). The router advertisement daemon receives the router solicitation message from the

cluster node (4); it replies with the router advertisement, specifying subnet prefix, lifetimes, default

router, and all other configuration parameters. Based on the received information, the cluster node

generates its IP address (5). The last step is when the cluster node verifies the usability of the address

by performing the Duplicate Address Detection process. As a result, the cluster node has now fully

configured its Ethernet interfaces for IPv6.

4.28.1 Changes Required to Support IPv6

Many changes are required on the cluster nodes to support IPv6. Some of the changes are common to

all cluster nodes; others are limited to the master nodes. In addition to supporting IPv6, we need to

add some missing IPv6 network functionalities. The following subsections describe the changes to the

HAS architecture prototype to support a dual IPv4 and IPv6 environment (Figure 101). We have

conducted many experiments and tests that are documented in [131].

4.28.1.1 Changes Needed on All Nodes

The single update that is required on all cluster nodes is to support the IPv6 protocol in the Linux

kernel. We demonstrate how to support IPv6 in the kernel in [132].

4.28.1.2 Specific Changes to Master Nodes

There are several changes needed on master nodes in order to fully support IPv6 and provide all

functionalities and services over IPv6. The core services that need to support IPv6 on these nodes

include NTP, DHCP, cluster virtual IP interface, and all others IP services provided by the master

nodes.

4.28.1.3 Added Functionalities to the Cluster

For general cluster infrastructure, we need to provide support for IPv6 DNS and IPv6 firewalling.

Figure 101 presents our prototyped cluster in an environment that supports IPv4 and IPv6. With the

illustrated environment, incoming traffic to the cluster can be over either IPv4 or IPv6. The cluster

175

virtual IP interface receives the traffic and forward it within the cluster. The upstream provider is the

authority that provides an IPv6 address range to the router that manages the assignment of IP

addresses within the cluster. The IPv6 DNS server is responsible for IPv6 addresses resolution. The

IPv6 in IPv4 tunnel in Figure 101 enables IPv6 hosts and routers to connect with other IPv6 hosts and

routers over the existing IPv4 Internet. IPv6 tunneling encapsulates IPv6 datagrams within IPv4

packets. The encapsulated packets travel across an IPv4 Internet until they reach their destination host

or router. The IPv6-aware host or router decapsulates the IPv6 datagrams, forwarding them as needed.

IPv6 tunneling eases IPv6 deployment by maintaining compatibility with the large existing base of

IPv4 hosts and routers.

Figure 101: A functional HAS cluster supporting IPv4 and IPv6

LAN 1LAN 2

Traffic Node 4

Traffic Node 3

Traffic Node 2

Traffic Node 1

Storage Node 2

Storage Node 1

IncomingTraffic

IPv6

IPv4

UpstreamProvider

IPv6 in IPv4 Tunnel

IPv6 DNSServer

Master Node B

CLUSTER

VIP

Master Node A

LAN 1LAN 2

Traffic Node 4

Traffic Node 3

Traffic Node 2

Traffic Node 1

Storage Node 2

Storage Node 1

IncomingTraffic

IPv6

IPv4

UpstreamProvider

IPv6 in IPv4 Tunnel

IPv6 DNSServer

Master Node B

CLUSTER

VIP

Master Node A

176

Chapter 5 Architecture Validation

5.1 Introduction

The initial goal of this work was to propose an architecture that allows web clusters to scale for up to

16 nodes while maintaining the baseline performance of each individual cluster node. The validation

of the architecture is an important activity that allows us to determine and verify if the architecture

meets our initial requirements. Network and telecom equipment providers use professional services of

specialized validation test centers to test and validate their products.

This chapter presents three types of validation for the HAS architecture. The first is the validation of

scalability and high availability. It present the benchmarking results that demonstrate the ability to

scale the HAS architecture for up to 18 nodes (2 master nodes and 16 traffic nodes) while maintaining

the baseline performance across all traffic nodes. In addition, this chapter presents the results of the

HA testing to validate the HA capabilities. The second validation is the external validation by open

source projects. It describes the impact of the work on the HA-OSCAR project. The third validation is

the adoption of the architecture by the industry as the base architecture for communication platforms

that run telecom applications providing mission critical services.

5.2 Validation of Performance and Scalability

The benchmarking environment we built to measure the performance and scalability the HAS

architecture prototype is similar to the benchmarking environment presented in Section 3.3, however

in testing the HAS architecture prototype, we used more client machines to generate traffic and tested

with a newer version of the WebBench software, version 5.0. To generate web traffic, we used 31

web client machines powered by Intel Pentium III and Pentium Celeron. Each client machine runs a

copy of the WebBench benchmarking software. In addition to the 31 web client machines, we used

one Pentium IV machine as the WebBench controller to managing the testing, collecting, and

compiling test results from all the web client machines.

WebBench is a benchmark program that measures the performance of web servers. WebBench uses

PC clients to send requests for standardized workloads to the web server. The workload is a

combination of static files and dynamic executables that run in order to produce the data the server

177

returns to the client. These client machines simulate web browsers. When the server replies to a client

request, the client records information such as how long the server took and how much data it

returned and then sends a new request. When the test ends, WebBench calculates two overall server

scores, requests per second and throughput in bytes per second, as well as individual client scores.

WebBench maintains at run-time all the transaction information and uses this information to compute

the final metrics presented when the tests are completed.

Figure 102: A screen capture of the WebBench software showing 379 connected clients

Figure 102 is a screen capture from the WebBench controller that shows 379 connected clients from

the client machines that are ready to generate traffic.

The benchmarking tests took place at the Ericsson Research lab in Montréal, Canada. Although the

lab connects to the Ericsson Intranet, our LAN segment is isolated from the rest of the Ericsson

network and therefore our measurement conditions are under well-defined control.

Figure 103 illustrates the network setup in the lab. The client computers run WebBench to generate

web traffic with one computer running WebBench as the test manager. These computers connect to a

fiber capable Cisco switch (2) through 100 MB/s links. The Cisco switch connects to the HAS cluster

(3) through a one GB/s fiber link. Most benchmarking tests we conducted over IPv4, with some

178

additional tests conducted over IPv6. The results demonstrate that we are able to achieve similar

results as with IPv4, however, with a slight decrease in performance [131].

Figure 103: The network setup inside the benchmarking lab

We experienced a decrease in the number of successful transactions per second per processor ranging

between -2% and -4% [133]. We believe that this is the direct result of the immaturity of the IPv6

networking stack compared to the mature IPv4 networking stack.

5.3 The Benchmarked HAS Architecture Configurations

The HAS architecture prototype consists of 18 nodes; each has a Pentium III processor and 512 MB

of RAM. We configured the HAS cluster prototype as follows:

- Two master nodes providing a single IP interface, cluster services, and storage service through a

redundant HA NFS implementation. The master nodes run following the 1+1 active/standby

redundancy model. They run redundant copies of NTP, RAD, DHCP, and TM. In addition, each

master node runs an instance of heartbeat, the CVIP, and the Ethernet redundancy daemon.

- The prototype included 16 traffic nodes that provide service to web clients. Each traffic node runs

a copy of Apache web server version 2.0.35. The Apache server serves documents available on

the shared storage. Each traffic node runs a copy of the traffic client daemon, which reports the

node load to the traffic manager daemons running on master nodes.

Figure 104 illustrates the benchmarked configurations of the HAS architecture prototype. In between

each of the benchmarking tests, all nodes in the HAS cluster were rebooted and initialized to ensure

that we were starting from a fresh install. The same applies to all the web client machines and the

WebBench controller machine.

One fiber-capable Cisco switch c2948g

1Gbps fiber link HAS Cluster

Rest of lab backbonePermanent 100Mbps Links

100Mbps Links31 client machines running WebBench to generate web traffic

1

2

3

1 machine running WebBench acting as the test manager

One fiber-capable Cisco switch c2948gOne fiber-capable

Cisco switch c2948g1Gbps fiber link HAS

Cluster

Rest of lab backbonePermanent 100Mbps Links

100Mbps Links31 client machines running WebBench to generate web traffic

1

2

3

1 machine running WebBench acting as the test manager

179

Figure 104: The benchmarked HAS cluster configurations showing Test-[1..4]

I performed the following four benchmarking tests:

- Test-1 generates traffic to a HAS cluster that consists of two master nodes and two traffic nodes.

This is the minimal deployment of HAS cluster.

- Test-2 generates traffic to a HAS cluster that consists of two master nodes and four traffic nodes.

This configuration has double the number of traffic nodes than Test-1.

- Test-3 generates traffic to a 10 processors HAS cluster that consists of two master nodes and

eight traffic nodes. This configuration has double the number of traffic nodes than Test-2.

- Test-4 generates traffic to a HAS cluster that consists of two master nodes and 16 traffic nodes.

This configuration has double the number of traffic nodes than Test-3.

In addition to testing the HAS architecture prototype, I performed a test on a single machine with a

Pentium III processor and 512 MB of RAM. The goal of this test, called Test-0, is to establish the

baseline performance, and allows us to realize the performance limitation of a single node.

NFS Server BNFS Server B

1+1 Active/Hot-Standby

HA Tier SSA Tier

Traffic Node 1Active Primary Master Node

Hot Standby Master Node

NFS Server A NFS Server A

DiskDiskWebBenchWeb Clients

Traffic Node 2

Traffic Node 3

Traffic Node 4

Traffic Node 5

Traffic Node 6

Traffic Node 7

Traffic Node 8

Traffic Node 9

Traffic Node 10

Traffic Node 11

Traffic Node 12

Traffic Node 13

Traffic Node 15

Traffic Node 14

Traffic Node 16

CLUSTER

VIP

Test-1

Test-2

Test-3

Test-4

NFS Server BNFS Server B

1+1 Active/Hot-Standby

HA Tier SSA Tier

Traffic Node 1Traffic Node 1Active Primary Master Node

Hot Standby Master Node

NFS Server A NFS Server A

DiskDiskWebBenchWeb Clients
















CLUSTER

VIP

Test-1

Test-2

Test-3

Test-4

180

5.4 Test-0: Experiments with One Standalone Traffic Node

This test consists of generating web traffic to a single standalone server running the Apache web

server software. This test reveals the performance limitation of a single node. We use the results of

this test to define the baseline performance. Apache 2.0.35 was running on this node and the NFS

server running on the network segment hosting the document repository.

Table 10 presents the results of the benchmark with a single server node. The results of Test-0 are

consistent with the tests conducted in 2002 and 2003 with an older version of Apache (Section 3.7).

The main lesson to learn from this benchmark is that the maximum capacity for a standalone server is

an average of 1,033 requests per second. If the server receives requests over its baseline capacity, it

becomes overloaded and unable to respond to all of them. Hence, the high number of failed requests

as illustrated in the table below. Table 10 presents the number of clients generating web traffic, the

number of requests per second the servers has completed, and the throughput. WebBench generates

this table automatically as it collects the results of the benchmarking test.

Number of Clients

Requests Per Second

Throughput (Bytes/Sec)

Throughput (KBytes/Sec)

Errors (Connection + Transfer)

1_client 143 825609 806 0 4_clients 567 3249304 3173 0 8_clients 917 5240358 5118 0

12_clients 1012 5769305 5634 0 16_clients 1036 5924397 5786 212 20_clients 1036 6044917 5903 456 24_clients 1038 6058103 5916 692

28_clients 1040 6063940 5922 932 32_clients 1037 6046431 5905 1012 36_clients 1037 6052267 5910 1252 40_clients 1037 6049699 5908 1484 44_clients 1029 6008823 5868 1736 48_clients 1032 6020502 5879 1983

52_clients 1033 6032181 5891 2237

Table 10: The performance results of one standalone processor running the Apache web server

181

0

200

400

600

800

1000

1200

1_clie

nt

4_clie

nts

8_clie

nts

12_c

lients

16_c

lients

20_c

lients

24_c

lients

28_c

lients

32_c

lients

36_c

lients

40_c

lients

44_c

lients

48_c

lients

52_c

lients

Number of Clients

Req

uest

s Pe

r Sec

ond

Requests Per Second

Figure 105: The results of benchmarking a standalone processor -- transactions per second

In Figure 105, we plot the results from Table 10, the number of clients versus the number of requests

per second. We notice that as we reach 16 clients, Apache is unable to process additional incoming

web requests and the scalability curve levels-off. Even though we are increasing the number of clients

generating traffic, the application sever has reached its maximum capacity and is unable to process

more requests. With this exercise, we conclude that the maximum number of requests per second we

can achieve with a single process is 1,035. We use this number to measure how our cluster scales as

we add more processors.

Figure 106 presents the throughput achieved with one processor. We plot the results from Table 10,

the number of clients versus the throughput in terms of KB/s. The maximum throughput possible with

a single processor averages around 5,800 KB/s. In addition, WebBench provides statistics about

failed requests. Table 10 presents the number of clients generating traffic and the number of failed

requests. Apache starts rejecting incoming requests when we reach 16 simultaneous WebBench

clients generating over 1,300 requests per second.

182

0

1000

2000

3000

4000

5000

6000

7000

1_clie

nt

4_clie

nts

8_clie

nts

12_c

lients

16_c

lients

20_c

lients

24_c

lients

28_c

lients

32_c

lients

36_c

lients

40_c

lients

44_c

lients

48_c

lients

52_c

lients

Number of Clients

Thro

ughp

ut (K

Byt

es/S

ec)


Figure 106: The throughput benchmarking results of a standalone processor

0

500

1000

1500

2000

2500

1_cli

ent

4_cli

ents

8_cli

ents

12_c

lients

16_c

lients

20_c

lients

24_c

lients

28_c

lients

32_c

lients

36_c

lients

40_c

lients

44_c

lients

48_c

lients

52_c

lients

Number of Clients

Req

uest

s Pe

r Sec

ond

Requests Per Second Errors (Connection + Transfer)

Figure 107: The number of failed requests per second on a standalone processor

183

Figure 107 illustrates the curve of successful requests per second combined with the curve of failed

requests per second. As we increase the number of clients generating traffic to the processor, the

number of failed requests increases. Based on the benchmarks with a single node, we can draw two

main conclusions. The first is that a single processor can process up to one thousand requests per

second before it reaches its threshold. The second conclusion is that after reaching the threshold, the

application server starts rejecting incoming requests.

5.5 Test-1: Experiments with a 4-nodes HAS Cluster

This test consists of generating web traffic to the cluster consisting of two master nodes (configured

in the active/standby redundancy model) and two traffic nodes. Table 11 presents the results of the

benchmarking test. The table presents the number of clients generating web traffic, the number of

requests per second the servers has completed, and the resulting throughput. The WebBench

controller machine generates this table automatically as it collects the results of the benchmarking

tests from all the WebBench client machines.

Number of Clients

Requests Per Second





12_clients 1744 10986836 10729 0 16_clients 1976 12455528 12164 97 20_clients 2030 12795912 12496 317 24_clients 2034 12821126 12521 547 28_clients 2043 12871553 12570 792 32_clients 2056 12953497 12650 991 36_clients 2062 12997621 12693 1203 40_clients 2087 13155206 12847 1466 44_clients 2089 13167813 12859 1689 48_clients 2086 13146174 12838 1973

52_clients 2089 13165089 12857 2217

Table 11: The results of benchmarking a four-nodes HAS cluster

The traffic nodes in the HAS cluster start rejecting new incoming requests as WebBench has reached

generating 2,073 requests (1,976 successful requests versus 97 failed requests). As WebBench adds

more web clients to generate traffic, we notice an increase of failed requests, with the number of

successful requests being almost constant ranging between 2,030 and 2,089 requests per second.

184

Requests Per Second: 2 Master Nodes and 2 Traffic Nodes

0

500

1000

1500

2000

2500

1_clie

nt

4_clie

nts

8_clie

nts

12_c

lients

16_c

lients

20_c

lients

24_c

lients

28_c

lients

32_c

lients

36_c

lients

40_c

lients

44_c

lients

48_c

lients

52_c

lients

Number of Clients

Req

uest

s P

er S

econ

d

Requests Per Second

Figure 108: The number of successful requests per second on a HAS cluster with four nodes

Figure 108 illustrates the results with a 4-processor cluster, illustrating the curve of the number of

transactions per second versus the number of clients. Figure 109 shows the throughput curve

illustrating the achieved throughput per second with a 4-processor cluster as we increase the number

of client machines generating traffic to the cluster.

185

Throughput (KBytes/Sec): 2 Master Nodes and 2 Traffic Nodes

0

2000

4000

6000

8000

10000

12000

14000

1_clie

nt

4_clie

nts

8_clie

nts

12_c

lients

16_c

lients

20_c

lients

24_c

lients

28_c

lients

32_c

lients

36_c

lients

40_c

lients

44_c

lients

48_c

lients

52_c

lients

Number of Clients

Thro

ughp

ut (K

Byt

es/S

ec)


Figure 109: The throughput results (KB/s) on a HAS cluster with four nodes

0

500

1000

1500

2000

2500

1_cli

ent

4_cli

ents

8_cli

ents

12_c

lients

16_c

lients

20_c

lients

24_c

lients

28_c

lients

32_c

lients

36_c

lients

40_c

lients

44_c

lients

48_c

lients

52_c

lients

Number of Clients

Req

uest

s pe

r Sec

ond

Successful Requests Errors (Connection + Transfer)

Figure 110: The number of failed requests per second on a HAS cluster with four nodes

186

Figure 110 shows the curve of successful requests per second combined with the curve of failed


number of failed requests increases.


This test consists of generating web traffic to a HAS cluster made up of two master nodes and four

traffic nodes. Table 12 presents the results of the benchmarking test. This table also lists the number

of client machines generating traffic, the successful number of requests per second, the throughput

achieved, and the number of failed requests per second because of either connection or transfer errors.

Number of Clients

Requests Per Second





12_clients 1752 10986836 10729 0 16_clients 2014 12695058 12398 0 20_clients 2350 14813003 14466 0 24_clients 2711 17088532 16688 0 28_clients 3013 18985857 18541 0 32_clients 3405 21463095 20960 0 36_clients 3725 23473882 22924 0 40_clients 4031 25421634 24826 0 44_clients 4101 26329324 25712 189 48_clients 4152 26662688 26038 243 52_clients 4163 26694742 26069 473 56_clients 4167 26790905 26163 784 60_clients 4171 26810137 26182 981 64_clients 4171 26803711 26175 1193 68_clients 4170 26805003 26177 1390

72_clients 4168 26793795 26166 1667 74_clients 4194 26959538 26328 1983 78_clients 4189 26927397 26296 2312 82_clients 4220 27126669 26491 2633 86_clients 4168 26792407 26164 2987 90_clients 4170 26805263 26177 3271

94_clients 4168 26792407 26164 3592

Table 12: The results of benchmarking a HAS cluster with six nodes

The maximum number of successful requests per second is 4,220, and the maximum throughput

reached is 26,491 KB/s.

187


0500

10001500200025003000350040004500

1_clie

nt

8_clie

nts

16_c

lients

24_c

lients

32_c

lients

40_c

lients

48_c

lients

56_c

lients

64_c

lients

72_c

lients

78_c

lients

86_c

lients

94_c

lients

Number of Clients

Req

uest

s pe

r Sec

ond

Requests Per Second

Figure 111: The number of successful requests per second on a HAS cluster with six nodes


0

5000

10000

15000

20000

25000

30000

1_clie

nt

8_clie

nts

16_c

lients

24_c

lients

32_c

lients

40_c

lients

48_c

lients

56_c

lients

64_c

lients

72_c

lients

78_c

lients

86_c

lients

94_c

lients

Number of Clients

Thro

ughp

ut (K

Byte

s/S

ec)


Figure 112: The throughput results (KB/s) on a HAS cluster with six four nodes

188

0

500

1000

1500

2000

2500

3000

3500

4000

4500

1_clie

nt

8_clie

nts

16_c

lients

24_c

lients

32_c

lients

40_c

lients

48_c

lients

56_c

lients

64_c

lients

72_c

lients

78_c

lients

86_c

lients

94_c

lients

Number of Clients

Req

uest

s pe

r Sec

ond

Successful Requests Errors (Connection + Transfer)

Figure 113: The number of failed requests per second on a HAS cluster with six nodes

Figure 113 presents the curve of successful requests per second combined with the curve of failed


number of failed requests increases.


This test consists of generating web traffic to a HAS cluster made up of two master nodes and eight

traffic nodes. Table 13 presents the results of the benchmarking test. The maximum number of

successful transactions per second is 8,316. The maximum throughput achieved is 51,073 KB/s.

Number of Clients

Requests Per Second



1_client 173 1125072 1099 4_clients 725 4377974 4275 8_clients 1440 9258378 9041

12_clients 1753 10978510 10721 16_clients 2015 12694844 12397 20_clients 2349 14797572 14451 24_clients 2729 17194114 16791 28_clients 3022 19007587 18562

189

32_clients 3339 21372529 20872 36_clients 3660 23498460 22948 40_clients 4042 25416830 24821 44_clients 4340 26272234 25656 48_clients 4560 27561631 26916 52_clients 4800 29637244 28943 56_clients 5090 31580773 30841 60_clients 5352 33656387 32868 64_clients 5674 35681682 34845 68_clients 5930 37298145 36424 72_clients 6324 39770012 38838 76_clients 6641 41770148 40791 80_clients 6910 43462088 42443 84_clients 7211 45550281 44483 88_clients 7460 46789359 45693 92_clients 7680 48292606 47161 96_clients 7871 49192039 48039

100_clients 8052 50833660 49642 104_clients 8158 51116698 49919 100_clients 8209 51311680 50109 104_clients 8278 52060159 50840 92_clients 8293 52154505 50932 96_clients 8310 52267720 51043

100_clients 8312 52280300 51055 104_clients 8307 52248851 51024 108_clients 8316 52299169 51073 112_clients 8313 52280300 51055 114_clients 8311 52274010 51049 118_clients 8310 52261431 51037 122_clients 8306 52242561 51018 126_clients 8319 52318038 51092 130_clients 8302 52217403 50994 134_clients 8308 52255141 51030 138_clients 8311 52267720 51043

142_clients 8312 52280300 51055

Table 13: The results of benchmarking a HAS cluster with 10 nodes

Figure 114 presents the curve of performance illustrating the number of successful requests per

second achieved with a HAS cluster that consists of two master nodes and eight traffic nodes. The

master nodes are in the 1+1 active/standby model and the traffic nodes follow the N-way redundancy

model, where all traffic nodes are active. Figure 115 shows the throughput curve of the 10-processor

HAS cluster.

190


0100020003000400050006000700080009000

1_clie

nt

8_clie

nts

16_c

lients

24_c

lients

32_c

lients

40_c

lients

48_c

lients

56_c

lients

64_c

lients

72_c

lients

80_c

lients

88_c

lients

96_c

lients

104_

clients

104_

clients

96_c

lients

104_

clients

112_

clients

118_

clients

126_

clients

134_

clients

142_

clients

Number of Clients

Req

uest

s Pe

r Sec

ond

Requests Per Second

Figure 114: The number of successful requests per second on a HAS cluster with 10 nodes


0

10000

20000

30000

40000

50000

60000

1_clie

nt

8_clie

nts

16_c

lients

24_c

lients

32_c

lients

40_c

lients

48_c

lients

56_c

lients

64_c

lients

72_c

lients

80_c

lients

88_c

lients

96_c

lients

104_

clients

104_

clients

96_c

lients

104_

clients

112_

clients

118_

clients

126_

clients

134_

clients

142_

clients

Number of Clients

Thro

ughp

ut (K

Byte

s/Se

c)


Figure 115: The throughput results (KB/s) on a HAS cluster with 10 nodes

191

5.8 Test-4: Experiments with an 18-nodes HAS Cluster

This test consists of generating web traffic to a HAS cluster made up of two master nodes and 16

traffic nodes. This test is the largest we conducted and consists of 18 nodes in the HAS cluster and 32

machines in the benchmarking environment, 31 of which generate traffic. Figure 116 presents the

number of successful transactions per second achieved with the 18 processors HAS cluster.


0

2000

4000

6000

8000

10000

12000

14000

16000

18000

1_clie

nt

16_c

lients

32_c

lients

48_c

lients

64_c

lients

80_c

lients

96_c

lients

104_

clients

104_

clients

118_

clients

134_

clients

148_

clients

164_

clients

178_

clients

194_

clients

208_

clients

224_

clients

240_

clients

256_

clients

272_

clients

Number of Clients

Req

uest

s Pe

r Sec

ond

Requests Per Second

Figure 116: The number of successful requests per second on a HAS cluster with 18 nodes

In this test, the HAS cluster with 18 nodes achieved 16,001 successful requests per second, an

average of 1,000 successful requests per second per traffic node in the HAS cluster.

192


0

20000

40000

60000

80000

100000

120000

1_clie

nt

16_c

lients

32_c

lients

48_c

lients

64_c

lients

80_c

lients

96_c

lients

104_

clients

104_

clients

118_

clients

134_

clients

148_

clients

164_

clients

178_

clients

194_

clients

208_

clients

224_

clients

240_

clients

256_

clients

272_

clients

Number of Clients

Thro

ughp

ut (K

Byte

s/Se

c)


Figure 117: The throughput results (KB/s) on a HAS cluster with 18 nodes

5.9 Scalability Charts

Table 14 presents the summary of the benchmarking results we collected from testing HAS clusters

consisting of two, four, eight and 16 traffic nodes, respectively; the table includes the results of

benchmarking a single standalone traffic node, which illustrates the baseline performance data.

Traffic Cluster Nodes Total Transactions Average Transactions per Traffic Node 1 1032 1032 2 2068 1034 4 4143 1036 8 8143 1017

16 16001 1000

Table 14: The summary of the benchmarking results of the HAS architecture prototype

For each testing scenario, we recorded the maximum number of requests per second that each

configuration supported. When we divide this number by the number of processors, we get the

maximum number of request that each processor can process per second in each configuration. Table

193

14 presents the number of successful transactions per traffic node. The total transactions is the total

number of successful transactions of the full HAS cluster as reported by WebBench. The average

transactions per traffic node is the average number of successful transaction served by a single traffic

node in the HAS cluster.

1032

2068

4143

8143

16001

1032 1034 1036 1017 1000

0

2000

4000

6000

8000

10000

12000

14000

16000

18000N

umbe

r of t

rans

actio

ns p

er s

econ

d(s

erve

d by

all

the

traffi

c no

des

in

the

HA

S cl

uste

r)

Total Number of Transactions in the HAS ClusterAverage Number of Transactions per Traffic Node

Total Number ofTransactions in the HASCluster

1032 2068 4143 8143 16001

Average Number ofTransactions per TrafficNode

1032 1034 1036 1017 1000

1 2 3 4 5

Figure 118: The results of benchmarking the HAS architecture prototype

Figure 118 presents the scalability of the prototyped HAS cluster architecture. Starting with one

processor, we established the baseline performance to be at 1,032 requests per second. Next, we setup

the HAS prototype and performed benchmarking tests as we scaled the number of traffic nodes from

two to 16. The HAS cluster maintained a 1,000 requests per second per traffic node. As we scaled by

adding more traffic nodes to the SSA tier, we lost -3.1% of the baseline performance per traffic node

194

as defined in Section 5.4. These results represent an improvement compared to the -40% decrease in

performance experienced with a cluster built using existing software that uses traditional methods of

scaling (Section 3.9).

Figure 119: The scalability chart of the HAS architecture prototype

Figure 119 illustrates the scalability chart of the HAS architecture prototype. The results demonstrate

a close to linear scalability as we increased the number of traffic nodes up to 16. The results

demonstrate that we were able to scale from a standalone single node performing an average

maximum of 1,032 requests per second to 18 nodes in the HAS cluster (16 of them serve traffic) with

an average of 1,000 requests per second per traffic node. The scaling is achieved with a -3.11%

decrease in performance as compared to the baseline performance.

5.10 Validation of High Availability

The validation of HA requires specialized testing and poses new challenges. In a real world situation,

validation of HA requires special monitoring and reporting tools and running the system for months

and monitoring its usage with a controlled stream of traffic while invoking error cases. The

monitoring and reporting tools would capture the details of the system under testing and produce

1032 10341036

1017

1000

980

990

1000

1010

1020

1030

1040

1 2 3 4 5


Tran

sact

ions

per

sec

ond

per

proc

esso

r

Average Transactions per traffic Node

2 4 8 16

195

result matrix. However, since we do not have access to specialized HA testing tools, we performed

basic test scenarios to ensure that the claimed HA support is provided and it works as described. An

important feature of a highly available system is its ability to continue providing service even when

certain cluster sub-system fails. Our testing strategy was based on provoking common faults and

observing how this affects the service and if it leads to a service downtime. In our case, the system is

the HAS architecture prototype, and the cluster nodes run web servers that provide service to web

users. If there is a service downtime, users do not get replies to their web requests.

The following sub-sections discuss the high availability testing for the HAS cluster and cover testing

the connectivity (Ethernet connection and routers), data availability (redundant NFS server), master

node, and traffic node availability.

5.10.1 Experiments with Connectivity Availability

Figure 120 illustrates the experiments we performed to test the high availability features of the HAS

architecture.

Figure 120: The possible connectivity failure points

The failures tested include Ethernet daemon failure, failure of Ethernet adapter, router failure,

discontinued communication between the traffic manager and the traffic client daemon and

LAN 2

Ethernet Card 1Router 2

Router 1

Processor


Linux Operating System

System Software


Ethernet Card 2

LAN 1

3

4

1

Processor



System Software


2

Web Server Application

EthDEthD TCD TMD

Traffic Node X

Master Node 1

HBD HBD

5

Heartbeat daemon running on Master Node 2

Experiment number

LAN 2

Ethernet Card 1Router 2

Router 1

Processor



System Software


Ethernet Card 2

LAN 1

3

4

1

Processor



System Software


2

Web Server Application

EthDEthD TCD TMD

Traffic Node X

Master Node 1

HBD HBD

5

Heartbeat daemon running on Master Node 2

Experiment number

196

discontinued communication between the heartbeat instances running on the master nodes. The

experiments included provoking a failure to monitor how the HAS cluster reacts to the failure, how

the failure affects the service provided, and how the cluster recovers from the failure.

5.10.1.1 Ethernet Daemon Becomes Unavailable

For this type of failure, there is no differentiation if the failure takes place on a master node or a

traffic node. If we shutdown the Ethernet daemon on a cluster node (Figure 120 – experiment 1),

there is no direct effect on the service provided

The one negative impact that can take place is when the primary Ethernet card fails (Ethernet Card 1),

the Ethernet redundancy daemon is not running to detect the failure and to switch to the backup

Ethernet card (Ethernet Card 2). As a result, the node becomes isolated from the cluster and the client

traffic daemon is not able to connect to the traffic manager. The traffic manager declares the traffic

node unavailable and removes it from its list of available traffic nodes. However, this case is an

unusual failure case scenario where both the Ethernet adapters and the Ethernet redundancy daemon

become unavailable on the same node.

5.10.1.2 Ethernet Card is Unresponsive

Similar to the use case above, there is no difference for type of failure if the failure occurs on a master

node or a traffic node. If the Ethernet card becomes unresponsive and unavailable (Figure 120 –

experiment 2), the Ethernet redundancy daemon detects the failure (within a range of 350 ms to 400

ms), considers the Ethernet card out of order, and starts the process of the network adapter swap

(Section 4.7.3). As a result, the Ethernet redundancy daemon designates the standby Ethernet card

(Ethernet Card 2) as the primary adapter. Section 6.1.9.3 covers the workings of this scenario. If the

unresponsive Ethernet card receives traffic while it is down, or while the transition to the second

active Ethernet card is not yet complete, new incoming connections will stall, and ongoing

connections are lost.

5.10.1.3 Router Failure

If the router fails (Figure 120 – experiment 3), we consider this failure a network problem and it is

beyond the scope of the HAS cluster architecture.

197

5.10.1.4 Discontinued Communication between the TM and the TC

This test case examines the scenario where we disrupt the communication between the TM and the

TC (Figure 120 – experiment 4). As a result, the TM stops receiving load alerts from the TC. After a

predefined timeout, the TM removes the traffic node from its list of available traffic nodes and stops

forwarding traffic to it. When we restore communication between the TM and the TC, the TC starts

sending its load messages to the TM. The TM then adds the traffic node to its list of available nodes

and starts forwarding traffic to it. Section 4.27.9 examines a similar scenario.

5.10.1.5 Discontinued Communication between the Heartbeat Instances

This test case examines what happens when we disconnect the communication between the heartbeat

instances running on the master nodes (Figure 120 – experiment 5). This failure scenario depends on

whether the disconnected node is the active master node or the standby master node.

If the heartbeat instance running on the active master node becomes unavailable, then the heartbeat

instance running on the standby node does not receive the keep-alive message. As a result, the

heartbeat instance on the standby node declares the primary master node unavailable, thereby making

the standby master node the active master node and owner of the virtual service. The new master

node starts receiving traffic within a delay of 200 ms.

If the heartbeat instance running on the standby node becomes unavailable, then the heartbeat

instance running on the master node does not receive the keep-alive message. The cluster does not

undergo a reconfiguration. However, the HA tier of the HAS cluster becomes vulnerable to SPOF

since there is only one master node that is available, and it is in the active state.

5.10.2 Experiments with Data Availability

The availability of data is essential in a clustered environment where data resides on shared storage

and multiple clients access it. In the HAS architecture prototype, the data is duplicated and available

on two NFS servers through a special implementation of the NFS server code and an updated

implementation of the mount program.

Our testing methodology uses a direct approach. Based on Figure 121, two components can affect

data availability: the availability of the NFS server daemons and the availability of the master nodes.

Master nodes are redundant and the failure of one master node does not affect the availability of data

or access to the provided service. The only scenario that could lead to data unavailability is if the NFS

198

server daemons running on master nodes were to crash. For this purpose, we have implemented

redundancy in the NFS server code. The two test cases we experimented with are shutting down the

NFS server daemon on a master node and disconnecting the master node from the network. In both

scenarios, there was no interruption to the service provided. Instead, there was a delay ranging

between 450 ms and 700 ms to receive the requested document.

Figure 121: The tested setup for data redundancy

5.10.3 Experiments with Traffic Node Availability

Traffic nodes follow the N redundancy model. If one node fails and becomes unavailable, the traffic

manager forwards traffic to other available traffic nodes. The failures tested include connectivity

problems, hardware and software problems, and TC problems.

Connectivity problems: Section 5.10.1 discusses the validation of high availability connectivity.

Unknown hardware or software problems leading to abnormal node unavailability: If the traffic node

goes to a halt state (power off), the traffic manager declares that node as unavailable (by taking it off

its list of available traffic nodes) since the traffic client daemon becomes unreachable and unable to

report node status to the traffic manager. Section 4.27.9 presents this case scenario.

TC problem: If the traffic client daemon becomes unavailable, because of software or a hardware

problem, the daemon does not report the node availability and status to the traffic managers. The

traffic node stops receiving traffic. Section 4.27.9 discusses a similar case scenario.

Master Node APrimary NFS Server Daemon

Node A Local Disk

Master Node BSecondary NFS Server Daemon

LAN 1

LAN 2

/mnt/CommonNFS

Node B Local Disk

/mnt/CommonNFS

Master Node APrimary NFS Server Daemon

Node A Local Disk

Master Node BSecondary NFS Server Daemon

LAN 1

LAN 2

/mnt/CommonNFS

Node B Local Disk

/mnt/CommonNFS

199

5.10.4 Experiments with Connection Synchronization

To test the connection synchronization mechanism, we open a connection to the virtual service while

the master node is active. Then we cause a fail-over to occur by powering down the master node. At

this point, the connection stalls. Once the CVIP address is failed over to the standby master node, the

connections continues.

Streaming is a useful way to test this, as streaming connections by their nature are open for a long

time. Furthermore, it provides intuitive feedback as the video pauses and then continues. It is of note

that by increasing the buffer size of the streaming client software, the pause is eliminated.

5.11 HA-OSCAR Architecture: Modeling and Availability Prediction

The HA-OSCAR project team evaluated the HA-OSCAR architecture, its system failure, and

recovery using the Stochastic Reward Nets (SRN) modeling approach with the goal to measure the

availability improvements introduced to the HA-OSCAR architecture. This section discusses the

Stochastic Reward Nets modeling approach, the Stochastic Petri Net Package (SPNP), the modeled

architecture, and the results.

5.11.1 Stochastic Reward Nets and Stochastic Petri Net Package

The HA-OSCAR project team built an SRN model to simulate the failure-repair behavior of the HA-

OSCAR cluster, and to calculate and predict the availability of the HA-OSCAR cluster. The goal of

the SRN model was to analyze the availability of the HA OSCAR cluster based on a range of

predefined assumptions and parameters.

The SRN are used in the availability evaluation and prediction for complicated systems, when the

time-dependent behavior is of interest. They applied the SRN for aiding the automated generation and

solution of the underlying large and complex Markov chains [134]. The Stochastic Petri Net Package

(SPNP) allows the specification of SRN models and the computation of steady and transient-state.

The HA-OSCAR team used the SPNP to build the HA-OSCAR SRN model and predict the

availability of the system [135].

The Stochastic Petri Net Package (SPNP) [136] is the modeling tool designed for SPN models [137],

used as a modeling tool for performance, dependability (reliability, availability, safety), and

performability analysis of complex systems. It uses efficient and numerically stable algorithms to

solve input models based on the theory of the SRN. The SRN models are described in the input

200

language for SPNP called CSPL (C-based SPN Language) which is an extension of the C

programming language with additional constructs that facilitate easy description of SPN models.

Additionally, if the user does not want to describe his model in CSPL, a graphical user interface is

available to specify all the characteristics as well as the parameters of the solution method chosen to

solve the model [138].

5.11.2 Building the HA-OSCAR Model

The HA-OSCAR project team used the SRN to develop a failure-repair behavior model for the

system, and to determine the high availability of the cluster. They utilized the SPNP to build and

solve the HA-OSCAR SRN model. Figure 122 illustrates how the behavior of the HA-OSCAR

cluster is divided into three sub-models: server, network connection, and client sub-models. The HA-

OSCAR team used the SRN model to study the availability of the HA-OSCAR cluster through the

example illustrated in Figure 122.

Figure 122: The modeled HA-OSCAR architecture, showing the three sub-models

Figure 123 shows a screen shot of the SPNP modeling tool. The HA-OSCAR team also studied the

overall cluster uptime and the impact of different polling interval sizes in the fault monitoring

mechanism.

Server sub-model

Network connection sub-model

Clients sub-model

Server sub-model

Network connection sub-model

Clients sub-model

201

`

Figure 123: A screen shot of the SPNP modeling tool

Input Parameter Numerical Value

Mean time to primary server failure, 1/λps 5,000 hrs.

Mean time to primary server repair, 1/µpsr 4 hrs.

Mean time to takeover primary server, 1/µst 30 sec.

Mean time to standby server failure, 1/λss 5,000 hrs.

Mean time to standby server repair, 1/µssr 4 hrs.

Mean time to primary LAN failure, 1/λpl 10,000 hrs.

Mean time to primary LAN repair, 1/µplr 1 hr.

Mean time to takeover primary LAN, 1/µlt 30 sec.

Mean time to standby LAN failure, 1/λsl 10,000 hrs.

Mean time to standby LAN repair, 1/µslr 1 hr.

Mean time to client permanent failure, 1/λp 2,000 hrs.

Mean time to client intermittent failure, 1/λi 1,000 hrs.

Mean time to system reboot, 1/µsrb 15 min.

Mean time to client reboot, 1/µcrb 5 min.

Mean time to client reconfiguration, 1/µrc 1 min.

Mean time to client repair, 1/µrp 4 hrs.

Permanent failure coverage factor, cp 0.95

Intermittent failure coverage factor, ci 0.95

Table 15: Input parameters for the HA-OSCAR model

202

Table 15 presents the input parameter values used in the computation of the measurements. These

values are for illustrative purpose only. The model makes several assumptions. First, the HA-OSCAR

cluster functions properly only if either primary master node or the standby master node is

functioning. Second, there has to be at least one LAN that is functioning properly, either the primary

LAN or the standby LAN. Third, the quorum for compute node must be present. The "Quorum Value

Q" denotes the minimum number of nodes required for the system to keep functioning. For example,

in the third row, we have an 8(N) node system. In order to keep the system working, at least 5(Q)

nodes must be alive. The final assumption is that the system is not undergoing any rebooting or

reconfiguration.

The availability of the cluster at time t is computed as the expected instantaneous reward rate E[X(t)]

at time t and its general expression is:

∑ ∈=

τπ

k kk trtXE )()]([

where rk represents the reward rate assigned to state k of the SRN, τ is the set of tangible marking,

and )(tkπ is the probability of being in marking k at time t [138][139].

5.11.3 The Results

The results of HA-OSCAR availability prediction study are available in [140] and [141]. The results

demonstrate that the component redundancy in HA-OSCAR is efficient in improving the cluster

system availability. Table 16 presents the steady-state cluster availability and the mean cluster down

time per year for the different HA-OSCAR configurations.

System Configuration

(N)

Quorum Value

(Q)

System Availability

(A)

Mean cluster down time

(t)

4 3 0.999933475091 34.9654921704

6 4 0.999933335485 35.0388690840

8 5 0.999933335205 35.0390162520

16 9 0.999933335204 35.0390167776

Table 16: System availability for different configurations

We notice that the system availabilities for the various configurations are very close, within a small

range of difference. After we introduce the quorum voting mechanism in the client sub-model, the

system availability is not sensitive to the change of client configuration. When we add more clients to

203

improve the system performance, the availability of the system almost remains unchanged. In Table

16, we notice that when we add more clients to improve the system performance in the first column N

is increasing, and we keep the value of Q to N/2+1, the availability of the system almost remains

unchanged. In column 3, the system availability is almost the same as we increase the number of

nodes in the system.

Figure 124 illustrates the instantaneous availabilities of the system when it has eight clients and the

quorum is five. The modeling and availability measurements using the SPNP offered the calculated

instantaneous availabilities of the system and its parameters.

Figure 124: System instantaneous availabilities

Figure 125 illustrates the total availability (including planned and unplanned downtime) improvement

analysis of the HA-OSCAR architecture versus single head node Beowulf architecture [140].

The results show a steady-state system availability of 99.9968% compared to 92.387% availability for

a Beowulf cluster with a single head node [140]. Additional benefits include higher serviceability

such as the ability to upgrade incrementally and hot-swap cluster nodes, operating system, services,

applications, and hardware, further improve planned downtime, which benefits the overall aggregate

performance.

204

Figure 125: Availability improvement analysis of HA-OSCAR versus the Beowulf architecture

5.11.4 Discussion

The HA-OSCAR architecture proof-of-concept implementation, experimental and analysis results,

suggest that the HA-OSCAR architecture offers a significant enhancement and a promising solution

to providing a high-availability Beowulf cluster class architecture [140][142][143]. The availability of

the experimental system improves substantially from 92.387% to 99.9968%. The polling interval for

failure detection indicates a linear behavior to the total cluster availability.

5.12 Impact of the HAS Architecture on Open Source

In 2001, I participated in the Open Cluster Group [144] and proposed the creation of a working group

whose goal is to design and prototype a highly available clustering stack for clusters that run mission

critical application. In 2002, the Open Cluster Group announced the creation of the HA-OSCAR

working group [21], High Availability Open Source Cluster Application Resource, which consisted of

Dr. Stephen S. Scott (Oak Ridge National Lab), Dr. Chokchai Leangsuksun (Louisiana Technical

University), and the author of this thesis. As a result, we started the open source HA-OSCAR project

to provide the combined power of high availability and high performance computing based on the

initial HAS architecture.

Model assumption:

- Scheduled downtime=200 hrs - Nodal MTTR = 24 hrs- Failover time=10s- During maintenance on the head, standby node acts as primary

The HA-OSCAR vs. the Beowulf ArchitectureTotal Availability impacted by service nodes

90.580%

91.575%92.081% 92.251% 92.336% 92.387%

99.9896%

99.9951% 99.9962% 99.9966% 99.9968%

99.9684%

90.00%

91.00%

92.00%

93.00%

94.00%

95.00%

96.00%

97.00%

98.00%

99.00%

100.00%

Mean time to failure (hr)

Avai

labi

lity

99.950%

99.955%

99.960%

99.965%

99.970%

99.975%

99.980%

99.985%

99.990%

99.995%

100.000%

Beowulf 0.905797 0.915751 0.920810 0.922509 0.923361 0.923873HA-OSCAR 0.999684 0.999896 0.999951 0.999962 0.999966 0.999968

1000 2000 4000 6000 8000 10000

Model assumption:

- Scheduled downtime=200 hrs - Nodal MTTR = 24 hrs- Failover time=10s- During maintenance on the head, standby node acts as primary

The HA-OSCAR vs. the Beowulf ArchitectureTotal Availability impacted by service nodes

90.580%

91.575%92.081% 92.251% 92.336% 92.387%

99.9896%

99.9951% 99.9962% 99.9966% 99.9968%

99.9684%

90.00%

91.00%

92.00%

93.00%

94.00%

95.00%

96.00%

97.00%

98.00%

99.00%

100.00%

Mean time to failure (hr)

Avai

labi

lity

99.950%

99.955%

99.960%

99.965%

99.970%

99.975%

99.980%

99.985%

99.990%

99.995%

100.000%

Beowulf 0.905797 0.915751 0.920810 0.922509 0.923361 0.923873HA-OSCAR 0.999684 0.999896 0.999951 0.999962 0.999966 0.999968

1000 2000 4000 6000 8000 10000

205

The goal of the HA-OSCAR project is to enhance a Beowulf cluster system for mission critical

applications, to achieve high availability and eliminate single points of failure, and to incorporate

self-healing mechanisms, failure detection and recovery, automatic failover and failback.

On March 23, 2004, the HA-OSCAR group announced the HA-OSCAR 1.0 release, with over 5000

downloads within the first 24 hours of the announcement. It provides an installation wizard and a

web-based administration tool that allows a user to create and configure a multi-head Beowulf cluster.

Furthermore, the HA-OSCAR 1.0 release supports high availability capabilities for Linux Beowulf

clusters. To achieve high availability, the HA-OSCAR architecture adopts component redundancy to

eliminate SPOF, especially at the head node. The HA-OSCAR architecture also incorporates a self-

healing mechanism, failure detection and recovery, automatic failover and failback [146]. In addition,

it includes a default set of monitoring services to ensure that critical services, hardware components,

and important resources are always available at the control node.

5.13 HA-OSCAR versus Beowulf Architecture

A Beowulf cluster consists of two node types: head node servers and multiple identical client nodes.

A server or head node is responsible for serving user requests and distributing them to clients via

scheduling and queuing software. Clients or compute nodes are dedicated to computation.

Figure 126: The architecture of a Beowulf cluster

Figure 126 illustrates the architecture of a Beowulf cluster. However, the single head-node

architecture represented in Beowulf cluster is a single-point-of-failure prone, similar to cluster

Head Node

Compute Nodes

Router

Clients

Head Node

Compute Nodes

Router

Clients

206

communication, where an outage of wither can render the entire cluster unusable. There are various

techniques to implement cluster architecture with high availability. These techniques include

active/active, active/standby, and active/cold standby. In the active/active, both head nodes

simultaneously provide services to external requests. If one head node goes down, the other node

takes over total control. A hot-standby head node, on the other hand, monitors system health and only

takes over control if there is an outage at the primary head node. The cold standby architecture is

similar to the hot standby, except that the backup head node is activated from a cold start.

The key effort focused on simplicity by supporting a self-cloning of cluster master node (redundancy

and automatic failover). While the aforementioned failover concepts are not new, HA-OSCAR

effortless installation, combined HA and HPC architecture are unique and its 1.0 release is the first

known field-grade HA Beowulf cluster release [34]. The HA-OSCAR experimental and analysis

results, discussed in Section 5.11, suggested a significant improvement in availability from the dual

head architecture [145].

Figure 127 illustrates the HA-OSCAR architecture. The HA-OSCAR architecture deploys duplicate

master nodes to offer server redundancy, following the active/standby approach, where one primary

master node is active and the second master node is standby [35]. Each node in the HA-OSCAR

architecture has two network interface cards (NIC); one has a public network address, and the other

has a private local network.

The HA-OSCAR project uses the SystemImager [147] utility for building and storing system images

as well as a providing a backup for disaster recovery purposes. The HA-OSCAR 1.0 release supports

high availability capabilities for Linux Beowulf clusters. It provides a graphical installation wizard

and a web-based administration tool to allow the administrator of the HA-OSCAR cluster to create

and configure a multi-head Beowulf cluster. In addition, HA-OSCAR includes a default set of

monitoring services to ensure that critical services, hardware components, and certain resources are

always available at the master node. The current version of HA-OSCAR, 1.0, supports active/standby

for the head nodes.

207

Figure 127: The architecture of HA-OSCAR

5.14 The HA-OSCAR Architecture versus the HAS Architecture

When we established the HA-OSCAR project in 2002, the HA-OSCAR architecture was based on the

HAS architecture. Both architectures target high availability at the master node level, aiming to

provide redundancy to reduce single points of failure, and to support redundant network connections.

However, since the applications that run on the HA-OSCAR architecture have unique characteristics

than applications running on a HAS architecture, we realize that while both architectures have the

same base, they grew to have different characteristics.

5.14.1 Target Applications

The HAS architecture and the HA-OSCAR architecture target different types of applications. The

HA-OSCAR architecture aims for HPC computing applications that focus on dividing programs into

smaller pieces, then executing them simultaneously on separate processors within the cluster. On the

Primary Head Node

Compute Nodes

Standby Head Node

OptionalReliableStorage

Public Network

Heartbeat

RedundantNetwork Connections

Redundant image serverssitting outside the cluster


Router 1 Router 2

Primary Head Node

Compute Nodes

Standby Head Node

OptionalReliableStorage

Public Network

Heartbeat

RedundantNetwork Connections

Redundant image serverssitting outside the cluster


Router 1 Router 2

208

other hand, the HAS architecture focuses on client/server applications that run over the web and

characterized with short transactions, short response time, a thin control path, and static delivery data

pack.

5.14.2 Intensity of Jobs

The HAS architecture supports thousands of transactions per second while the HA-OSCAR

architecture, targeted for HPC applications, supports large computing jobs that are divided into

smaller jobs executed on the cluster compute nodes. The HA-OSCAR architecture receives fewer

jobs but the jobs run for longer periods.

5.14.3 Homogeneous versus Heterogeneous Node Hardware

The HA-OSCAR architecture assumes that all cluster nodes are homogeneous with the same

hardware characteristics. You can have heterogeneous hardware; however, this will not allow you to

take advantage of the faster nodes in the cluster. In contrast, the HAS architecture accommodates

differences in the hardware configurations of the nodes and uses all node resources fully using the

dynamic feedback mechanism.

5.14.4 Dynamic Feedback Mechanism

The HAS architecture supports a dynamic feedback mechanism that allows traffic nodes to report

their load to the master nodes, and to guide the executor of the traffic distribution policy to make the

best possible distribution at that moment. Furthermore, this mechanism acts as a keep-alive message;

anytime a master node does not receive this message from a traffic node, after a pre-defined timeout,

the master node stops forwarding requests to the traffic node.

5.14.5 Traffic Distributions versus Load Balancing

Since the type of applications supported by the HA-OSCAR architecture are different from those

supported by the HAS architecture, the underlying distribution mechanisms are distinctive. In the

HA-OSCAR architecture, distribution is a load balancing function of a single job divided into smaller

jobs, and distributed across computer nodes. In the HAS architecture, the cluster receives thousands

of requests per second, and the traffic manager needs to decide on where to forward these incoming

requests among available traffic nodes.

209

5.14.6 Failure Discovery and Recovery Mechanisms

Both the HA-OSCAR architecture and the HAS architecture support failure detection and recovery

mechanism. However, these mechanisms target different system component with variable failure

detection and recovery time. The failure recovery in the HA-OSCAR takes between 3 seconds and 5

seconds [148], compared to a failure recovery ranging between 200 ms and 700 ms in the HAS

cluster, depending on the type of failure.

5.14.7 Support for the Redundancy Models

The HAS architecture supports the 1+1 active/active redundancy model at the HA tier. This is a

roadmap feature for the HA-OSCAR architecture.

5.14.8 Connections Synchronization

The HAS architecture supports connection synchronization at the HA tier. We discuss this feature in

Section 4.22. The HA-OSCAR architecture does not provide such capabilities.

5.14.9 High Availability of Traffic Nodes

The availability of traffic nodes (also called compute nodes) in the HA-OSCAR architecture is less

important than the availability of traffic nodes in the HAS architecture.

5.14.10 Application Keep-alive Mechanism

The HAS architecture supports a keep-alive mechanism for applications running on traffic nodes. In

the event that an application becomes unavailable, the LDirectord module notifies the executor of the

traffic distribution policy (the TM), and as a result, the TM stops forwarding traffic to the traffic node

with the unavailable application.

5.14.11 Other Differentiations

The HA-OSCAR deploys a software utility called SystemImager to provide the installation

infrastructure of HA-OSCAR. In addition, the installation wizard of HA-OSCAR uses a graphical

user interface to install and setup the HA-OSCAR cluster. In the HAS architecture prototype, the

installation infrastructure is built from scratch and it does not offer a graphical user interface to guide

the administrator in installing a HAS cluster.

210

5.15 HAS Architecture Impact on Industry

The Open Source Development Labs (OSDL) is a non-for-profit organization founded in 2000 by IT

and Telecommunication companies to accelerate the growth and adoption of Linux based platforms

and standardized platform architectures. The Carrier Grade Linux (CGL) initiative at OSDL aims to

standardize the architecture of telecommunication servers and enhance the Linux operating system for

such platforms.

The CGL Working Group has identified three main categories of application areas into which they

expect the majority of applications implemented on CGL platforms to fall. These application areas

include gateways, signaling, and management servers, and have different characteristics. A gateway,

for instance, processes a large number of small requests that it receives and transmits them over a

large number of physical interfaces. Gateways perform in a timely manner, close to hard real time.

Signaling servers require soft real time response capabilities, and manage tens of thousands of

simultaneous connections. A signaling server application is context switch and memory intensive,

because of the quick switching and capacity requirements to manage large numbers of connections.

Management applications are data and communication intensive. Their response time requirements

are less stringent compared to those of signaling and gateway applications.

Figure 128: The CGL cluster architecture based on the HAS architecture

211

The OSDL released version 2.0 of the Carrier Grade specifications in October 2003. Version 2.0 of

the specifications introduced support for clustering requirements and the cluster architecture is based

on the work presented in this dissertation. Figure 128 illustrates the CGL architecture, which is based

on the HAS architecture. In June 2005, the OSDL released version 3.1 of the specification. The

Carrier Grade architecture is a standard for the type of communication applications presented earlier.

212

Chapter 6 Contributions, Future Work, and Conclusion

This chapter presents the contributions of the work, future work, and the conclusions.

6.1 Contributions

The initial goal of this dissertation was to design and prototype the necessary technology

demonstrating the feasibility of a web cluster architecture that is highly available and able to linearly

scale for up to 16 processors to meet the increased web traffic.

Figure 129: The contributions of the HAS architecture

We achieved our goal with the HAS architecture that supports continuous service through its high

availability capabilities, and provides close to linear scalability through the combination of multiple

parameters which include efficient traffic distribution, the cluster virtual IP layer and the connection

synchronization mechanism. Figure 129 provides an illustration of other contributions grouped into

eight distinct areas: application availability, network availability, data availability, master nodes

Efficient and Light-weight Traffic Distribution

Cluster IP Interface

LDirectord

Data Redundancy

HAS

Architecture

Lightweight and efficient traffic distribution:- Traffic Client Daemon- Traffic Manager Daemon- HAS distribution algorithm- Keep-alive mechanism - Support for heterogeneous cluster nodes

Application availability- Enhancement to existing LDirectord implementation

- Adaptations to integrate LDirectord with the HAS architecture and to communicate with TC and TM

HA Ethernet connection- The Ethernet redundancy daemon - Several fixes to the Ethernet device driverthat made it more robust and less prone to failures under heavy load

Single entry point to the HAS cluster- Concept and prototype- Full transparency

Highly available network file system- Support for HA NFS in the Linux Kernel- A modified mount program to support mounting dual redundant NFS servers

Connection Redundancy

Connection synchronization- Adaptations and integration with the HAS prototype

- Improved integration with heartbeat

Saru- Adaptations and integration with the HAS prototype

HA Tier Nodes Heartbeat- Re-wrote parts of current implementation to decrease failure detection time

- Adaptations to integrate with the HAS architecture

HAS ArchitectureA building block approach for scalable and HA web clustersSoftware components can be re-used in other environments and for other architecturesSupports different redundancy models Dynamic addition of traffic nodes as traffic increasesProvides the infrastructure for cluster membership, cluster storage, fault management, recovery mechanisms and traffic distributionCapable of maintaining baseline performance per node for up to 16 traffic nodes

HBD

saruConnectionSynchronization

Efficient and Light-weight Traffic Distribution

Cluster IP Interface

LDirectord

Data Redundancy

HAS

Architecture

Lightweight and efficient traffic distribution:- Traffic Client Daemon- Traffic Manager Daemon- HAS distribution algorithm- Keep-alive mechanism - Support for heterogeneous cluster nodes

Application availability- Enhancement to existing LDirectord implementation

- Adaptations to integrate LDirectord with the HAS architecture and to communicate with TC and TM

HA Ethernet connection- The Ethernet redundancy daemon - Several fixes to the Ethernet device driverthat made it more robust and less prone to failures under heavy load

Single entry point to the HAS cluster- Concept and prototype- Full transparency

Highly available network file system- Support for HA NFS in the Linux Kernel- A modified mount program to support mounting dual redundant NFS servers

Connection Redundancy

Connection synchronization- Adaptations and integration with the HAS prototype

- Improved integration with heartbeat

Saru- Adaptations and integration with the HAS prototype

HA Tier Nodes Heartbeat- Re-wrote parts of current implementation to decrease failure detection time

- Adaptations to integrate with the HAS architecture

HAS ArchitectureA building block approach for scalable and HA web clustersSoftware components can be re-used in other environments and for other architecturesSupports different redundancy models Dynamic addition of traffic nodes as traffic increasesProvides the infrastructure for cluster membership, cluster storage, fault management, recovery mechanisms and traffic distributionCapable of maintaining baseline performance per node for up to 16 traffic nodes

HBD

saruConnectionSynchronization

213

availability, connection synchronization, single cluster IP interface, and traffic distribution. Since the

HAS architecture prototype follows the building block approach, these contributions can be reused in

a different environments outside of the HAS architecture and can function completely independently

outside of a cluster environment.

The HAS architecture is based on loosely coupled nodes, provides a building block approach for

designing and implementing software components that can be re-used for other environments and

architectures. It provides the infrastructure for cluster membership, cluster storage, fault management,

recovery mechanisms, and traffic distribution. It supports various redundancy models for each tier of

the architecture and provides a seamless software and hardware upgrade without interruption of

service. In addition, the HAS architecture is able to maintain baseline performance for up to 18

cluster nodes (16 traffic nodes), validating a close to linear scaling.

The HAS architecture integrates these contributions within a framework that allows us to build

scalable and highly available web clusters. The following sub-sections examine these contributions.

6.1.1 Highly Available and Scalable Architecture

We have achieved our goal of a highly available and scalable system architecture that provides close

to linear scalability for up to 16 nodes. The HAS architecture is a generic cluster based architecture

targeted for web servers. A HAS cluster consists of multiple loosely coupled nodes connected through

a high-speed network. The loosely coupled cluster model is a basic clustering technique that is

suitable for web servers and similar types of application servers. We do not exclude specializations;

however, we can handle them in the future to meet the requirements of specialized applications.

6.1.2 Scalability and Capacity

The HAS architecture is characterized by the ability to add server nodes to the cluster without

affecting the servicability or the uptime. When we experience an increase in traffic, we can add more

traffic nodes into the SSA tier and experience additional capacity. The benchmarking results (Section

5.2) demonstrate the HAS architecture prorotype is able to sustain the baseline performance per

processor as we increase the number of processors in the HAS cluster with a minimal impact of -3.1%

on performance (Section 5.9) for up to 16 traffic processors. Figure 119 illustrates the close to linear

scalability achieved with the HAS architecture prototype, which consists of two master nodes and 16

traffic nodes. In addition, because the SSA tier of the HAS architecture support the N-way

214

redundancy model, the architecture does not force us to deploy traffic nodes in pairs. As a result, we

can deploy exactly the right number of traffic nodes to meet our traffic demands without having

traffic nodes sitting idle. In addition, we are able to scale each tier of the architecture independently of

the other tiers.

6.1.3 Support for Heterogeneous Hardware

The HAS cluster architecture supports heterogeneous hardware. Typically, clusters are deployed with

nodes that are homogenous, with the same hardware configurations. A HAS cluster can consist of

traffic nodes with varying processor speed and RAM capacity and still achieve efficient resource

utilization largely because of the efficient and lightweight traffic management scheme.

6.1.4 Lightweight and Efficient Traffic Distribution

Among the surveyed works, the SWEB project (Section 2.10.3) and the Hierarchical Redirection-

Based Web Server Architecture (Section 2.10.1) considered using a load feedback mechanism to

monitor the load of their cluster nodes. Other surveyed work proposed mechanisms and distribution

schemes that do not take into consideration the capacity of each traffic node. For the HAS

architecture, there was a need to either implement a continuous load monitoring or enhance an

available traffic distribution implementation with the load monitoring capability. The HAS traffic

distribution scheme monitors the load of the traffic nodes using multiple metrics, and uses this

information to distribute incoming traffic among the traffic nodes (Section 4.23.2). The traffic

distribution scheme enables the traffic manager to detect and avoid situations where nodes in the

cluster are operating inefficiently because of overloading. This scheme enables the cluster to operate

at or near full capacity in overload situations, which is in contrast to conventional cluster architectures

that are subject to congestion and collapse under overload.

In addition, the HAS traffic distribution scheme integrates a keep-alive mechanism which allows

master node to know when a traffic node is available for service and when it is not available because

of either software or hardware problems.

Section 4.23 presents the HAS traffic management scheme. The scheme is scalable (Section 5.2) and

does not present a performance bottleneck for up to 16 traffic nodes. It can be extended to support

additional distribution algorithms, to monitor additional system resources (such as network

bandwidth), and to support the concept of cluster sub-zones (discussed as a future work item).

215


The architecture tiers support two essential redundancy models: the 1+1 (active/standby and

active/active) and the N-way redundancy models. As a result, the HAS cluster architecture achieves

high availability through redundancy at various levels of the architecture: network, processors,

application servers, and data storage. As a result, we are able to do such actions as reconfigure the

network setting, upgrade the hardware, software, and the operating system, without service downtime.

Other areas of contributions include a mechanism to detect and recover from Ethernet failures, master

node failures, NFS failures, and application (web server software) failures.

6.1.6 Cluster Single IP Interface

One of the challenges of clustering is the ability to present the cluster nodes to the outside world as a

single entity without creating a bottleneck that would affect the performance. The proposed approach

offers a transparent solution to web clients, and allows web clients to reach an unlimited number of

servers (Section 4.21). The proposed CVIP approach provides many advantages in the areas of single

cluster interface: transparency, scalability since it supports high load of traffic, availability, fault

tolerance, and dynamic connection distribution.

6.1.7 Continuous Service

The ability to synchronize connections provides us with continuous service even in the event of

software of hardware failures. Section 4.22.2 discusses the master/slave approach to provide

connection synchronization between the master nodes in the HA tier of the HAS cluster. The

approach allows ongoing connections to continue in progress even when the active master node fails.

6.1.8 Community and Industry Contributions

This work has served as a significant contributor to industry standard and to the open source

community through the HA-OSCAR project. The HA-OSCAR architecture has remarkably

influenced the thinking in high performance computing. Researchers and systems designers for HPC

cluster based systems are using the HA-OSCAR architecture as a concept prototype for a highly

available HPC cluster. The HA-OSCAR project is the first prototype of an architecture that provides

high availability for high performance computing platform. The first release of the HA-OSCAR

generated over five thousands download within the first 24 hours of its release followed by thousands

216

more in the weeks and months after. This public rush is a indication of the community of user who

are using, testing, and deploying the HA-OSCAR architecture for their specific needs. It is also

important to note that there is an active community of users on the HA-OSCAR project discussion

board and mailing list. The HA-OSCAR architecture is based on the HAS architecture and provides

an open source implementation that is freely available for download with a substantial user

community.

Section 5.15 described our contributions to the Carrier Grade specifications that define an

architectural model for telecommunication platforms providing voice and data communication

services. The Carrier Grade architecture mode is an industry standard largely based on the HAS

architecture.

6.1.9 Source Code Contributions

The HAS architecture was prorotyped and several software components were prorotyped and other

existing software were enhanced and modified to suite our needs. Since the beginning, the objective

was to keep these software components generic, scalable in terms of performance and future feature

additions, and less dependable on each other. For some software, we were able to reach that goal and

for others, we were not able to achieve this goal and we plan further improvements as discussed in the

following subsections.

We did not target the implementations to a specific cluster deployment model such as web cluster;

rather we sought to maintain a sense of independence from a specific deployment case.

In order to build the HAS web cluster, redundancy features for the network file system and Ethernet

connections were implemented at the kernel level and user space in the Linux operating system. A

part of the work behind this thesis involved writing the source code for the various system

components.

The following sub-sections describe the source code contributions and their state of development.

6.1.9.1 Redundant NFS Server

It was essential to have this functionality in place to enable highly available shared storage that is

accessible to all cluster nodes without a SPOF. As a result, in the event that the primary NFS server

running on the active master node fails, the secondary NFS server running on the standby node

continues to provide storage access without service discontinuity and transparently to end users. The

217

rsync utility provides the synchronization between the two NFS servers. Table 17 lists the Linux

kernel files modified to support the NFS redundancy.

Linux Kernel - Modified Source Files Description of Changes /usr/src/linux/fs/nfs/inode.c Added support for NFS redundancy /usr/src/linux/net/sunrpc/sched.c Raise the timeout flag when there is one /usr/src/linux/net/sunrpc/clnt.c Raise the timeout flag when there is one

Table 17: The changes made to the Linux kernel to support NFS redundancy

The implementation of the HA NFS server is stable, however, it requires upgrading to the latest stable

Linux Kernel release, version 2.6.

Furthermore, the HA NFS implementation requires a new implementation of the mount program to

support mounting multi-host NFS servers, instead of single file server mount. This functionality is

provided, and in the new mount program, the addresses of the two redundancy NFS servers are

passed as parameters to the new mount program, and then to the kernel. The new command line for

mounting two NFS server looks as follows: % mount –t nfs server1,server2:/nfs_mnt_point

6.1.9.2 Heartbeat Mechanism

We re-used the heartbeat mechanism developed initially by the Linux-HA project after making

several adaptations to the source code, and re-wroting part of the original source code making

optimizations that resulted in a decrease of the failure detection delay from 200 ms to 150 ms.

Furthermore, we can expect additional improvements to the failure detection time down to 100 ms

from 150 ms with further optimizations and re-writing parts of the original source code. The

limitation of the current implementation is its support to two nodes. As future work, we would like to

investigate using the heartbeat mechanism with more than two nodes in the HA tier and supporting

the N-way redundancy model.

6.1.9.3 Redundant Ethernet Connections

The goal with this feature is to maintain Ethernet connectivity at the server node level. The

implemented Ethernet redundancy daemon monitors the link status of the primary Ethernet port. On

link down, the route of the first port is deleted, and incoming traffic is directed to the second Ethernet

218

port. When the link goes up again, the daemon waits to make sure the connection does not drop again,

and then switches back to the primary Ethernet port.

In addition to this contribution, smaller supporting contributions included fixes and re-writes of the

Ethernet device drive; all of these supporting contributions are now integrated in the original Ethernet

device driver code in the Linux kernel.

Further improvements to the current implementation include stabilizing the source and optimizing the

performance of the Ethernet daemon, which include optimizing the failure detection time of the

Ethernet driver. In addition, the source code of the Ethernet redundancy daemon is to be upgraded to

run on the latest release of the Linux kernel, version 2.6.

6.1.9.4 Traffic Manager Daemon

The traffic manager runs on master nodes. It receives load announcements from the traffic client

daemons running on traffic nodes, compute the rank of each node, and update its internal list of nodes

and their load. It maintains a list of available traffic nodes and their load. The traffic manager is also

responsible for distributing incoming traffic to the available traffic nodes based on their load.

The current implementation of the traffic manager can use several improvements that include further

testing and stabilizing of the source code, optimizing insertion, and updates of the list of traffic nodes

and their load index. As future work, we would like to investigate the possibility of merging the

functionalities of the saru module (connection synchronization) with the traffic manager.

6.1.9.5 Traffic Client Daemon

The traffic client is a daemon process that runs on each traffic node. It collects processor and memory

information from the /proc file system, computes the load index, and communicates it to the traffic

managers running on master nodes.

The current implementation of the traffic client is stable because of the many testing it went through.

As future work, we plan to investigate how to have a better representation of the load index, by

adding for instance the network bandwidth parameter into the load index computation. In addition, we

would like to investigate the possibility of merging the functionalities of the LDirectord module with

the traffic client resulting in one system software that reports the node load index and monitors the

health of the application server running on the traffic node.

219

6.1.9.6 Cluster virtual IP Interface (CVIP)

The CVIP interface is a cluster virtual IP interface that presents the HAS cluster as a single entity to

the outside world, making all nodes inside the cluster transparent to end users. Section 4.21 discusses

the CVIP interface. Sections 6.2.7 and 6.2.8 discuss the future work items for CVIP.

6.1.9.7 LDirectord

The improvements and adaptations to the LDirectord module include capabilities to connect to the

traffic manager and the traffic client. The current implementation is not fully optimized; rather, the

implementation is a working prototype that requires further testing and stabilization. Furthermore, as

future work, we would like to minimize the number of sequential steps to improve the performance,

and investigate the possibility of integrating the LDirectord module with the traffic client running on

the traffic node.

6.1.9.8 Connection Synchronization

The improvements and adaptations to the saru module include improving capabilities to connect and

talk to the hearbeat mechanism. The current implementation of the saru module relies on the

redundancy relationship between master nodes. As future work, we plan to migrate to the peer-to-peer

approach discussed in Section 4.22.4.

6.1.10 Other Contributions

The work has also contributed several papers, technical reports, tutorials presentations, in addition to

offering contributions to existing projects and to the industry.

- Referenced papers, technical reports and white papers: [30], [96], [99], [100], [103], [107], [108],

[117], [146], [145], [149], [150], [151], [152], [153], [154], [155], [156], [157], [158], [159],

[160], [161], [162], [163] and [164].

- Conference presentations and tutorials: [165], [166], [167], [168], [169], [170], [171], [172],

[173], [174], [175], [176], [177], [178], [179], [180], [181] and [182].

- In the process of evaluating existing works, we benchmarked existing solutions, and provided

direct input to the specific projects. A major advantage we had was the availability of a test lab

that allowed us to perform large scale testing for performance and scalability.

220

- We ported the Apache and Tomcat web servers to support IPv6 and performed bechmarking tests

to compare with benchmaking tests of Apache and Tomcat running over IPv4.

- As part of the dissertation, we needed a flexible cluster installation infrastructure that help us built

and setup clusters within hours instead of days, and that would accommodate for nodes with disks

and diskless nodes and network boot. This infrastructure did not exist and we had to design it and

build it from scratch. This cluster installation infrastructure is now being used at the Ericsson

Research lab in Montréal, Canada.

- Other contributions included the influence of the work on the industry. The Carrier Grade Linux

specifications are industry standards with a defined architecture for telecom platforms and

applications running on telecom servers in mission critical environments. The carrier grade

architecture relies on the work proposed in this thesis with little modifications to accommodate

specific type of telecommunication applications. The author of the thesis is publicly recognized as

a contributor to the Carrier Grade specification. Furthermore, since January 2005, he has been

employed by the OSDL to focus on advancing the specifications and the architecture.

6.2 Future Work

The following sub-sections propose several areas that would extend and improve the current work.

With the exception of the maintenance activities to improve the current HAS prototype, the areas of

interest range from running additional benchmarking tests, to implementing new system components

and providing new functionalities. In addition, investigating linearly scalability beyond 16 nodes is a

high priority future work. The top two future work include the redundancy configuration manager and

the cluster configuration manager.

6.2.1 Support Linear Scalability Beyond 16 Nodes

With the HAS architecture, we were able to reach a near linear scalability for up to 16 traffic nodes

(Section 5.9). The next major goal is to investigate how to accomplish higher levels of scalability for

clusters consisting of 32 nodes and beyond. One of the important further investigations in this area

relates to the scaling overhead of the cluster. In the HAS cluster architecture, the scaling overhead has

been minimal but limited to 16 nodes. As the cluster scales in the number of nodes, many potential

bottlenecks affect the level of performance such as storage, communication, application servers, and

221

traffic distribution mechanism. The goal of this activity is to investigate the source of bottlenecks and

explore solutions.

6.2.2 Traffic Client Implementation

In a future version, we plan to allow the traffic client to receive notification that a master node is not

available directly from the heartbeat mechanism. With such information, the traffic client stops

reporting its load index to a master node that is unavailable, minimizing its communication overhead

and resulting in less network traffic.

6.2.3 Additional Benchmarking Tests

Because of limitations such as timing and access to lab hardware, we were not able to conduct more

benchmarking in the lab against the HAS prototype. As future work, we would to establish a larger

benchmarking environment that consists of more test machines to generate additional traffic into the

HAS cluster. Furthermore, we would like to benchmark the HAS cluster with the master nodes in the

HA tier are in load sharing mode, 1+1 active/active model (Figure 130).

Figure 130: The untested configurations of the HAS architecture

We expect to achieve high performance levels when both master nodes are receiving incoming traffic

and forwarding it to traffic nodes. Furthermore, we would like to benchmark the HAS architecture

prototype using specialized storage nodes and compare the results to when using the HA NFS

implementation to provide storage. These tests will give us insights on the most efficient storage

solution.

Master Nodes1+1(Active/Standby& Active/Active)

Traffic NodesN active & N active/M standby

Untested HAS Architecture Configurations

Master Nodes1+1Active/Active

N active Traffic Nodes

Storage: HA NFS


Traffic NodesN activeM standby

Specialized Storage Nodes Storage: HA NFS

Master Nodes1+1(Active/Standby& Active/Active)

Traffic NodesN active & N active/M standby

Untested HAS Architecture Configurations



N active Traffic Nodes

Storage: HA NFS



Traffic NodesN activeM standby

Specialized Storage Nodes Storage: HA NFS

222

6.2.4 Redundancy Configuration Manager

The current prototype of the HAS architecture does not support dynamic changes to the redundancy

configurations nor transitioning from one redundancy configuration to another. This feature is very

useful when the nodes reach a certain pre-defined threshold, then the redundancy configuration

manager would for example transition the HA tier from the 1+1 active/standby to the 1+1

active/active, allowing both master nodes to share and service incoming traffic. Such a transition in

the current HAS architecture prototype requires stopping all services on master nodes, updating the

configuration files, and restarting all software components running on master nodes.

The redundancy configuration manager would be the entity responsible for switching the redundancy

configuration of the cluster tiers from one redundancy model to another. For instance, when the SSA

tier is in the N+M redundancy model, the redundancy configuration manager will be responsible to

activate a standby traffic node when a traffic node becomes unavailable. As such, the configuration

manager should be aware of active traffic nodes and the states of their components, and their

corresponding standby traffic nodes.

6.2.5 Cluster Configuration Manager

The current HAS cluster prototype does not provide a central location for managing the cluster

configuration files required by all the system software. As a result, the process of locating and editing

all configuration files needed to run the cluster is a daunting experience. There is a need for a central

entity, preferably with a user-friendly graphical user interface, that manages all the configuration files

that control the operation of the various software components. This entity, a cluster configuration

manager, is a one-stop configuration that enables easy and centralized configuration.

6.2.6 Monitoring Load of Master Nodes

We would like to expand the current HAS implementation to include a logging mechanism that

monitors and reports the load of master nodes such as processor usage, free memory and network

bandwidth. We can then use this information to determine the impact of incoming traffic on the

performance and scalability of master nodes. Such information can also help us identify the node sub-

systems that suffer from bottlenecks, which is particularly useful in a controlled lab environment

when running benchmarking tests and generating thousands of requests per second.

223

6.2.7 Merging the Functionalities of the CVIP and Traffic Management Scheme

One possibility of further investigation is to couple the functionalities of the cluster virtual IP

interface with the traffic management scheme. With the current implementation, incoming traffic

arrives to the cluster through the cluster virtual IP Interface and then it is handled by the traffic

manager before it reaches its final destination on one of the traffic nodes. A future enhancement is to

eliminate the traffic management scheme and incorporate traffic distribution within the CVIP

interface. Eliminating the traffic manager daemon and integrating its functionalities with the CVIP

results in increased performance and a faster response time as we eliminate one serialized step in

managing an incoming request.

However, our proposal is to combine the functionalities of the cluster virtual IP interface and the

traffic distribution mechanism to eliminate a step of forwarding traffic in between web users and the

application server running the web server.

6.2.8 Eliminating the Need for Master Nodes

In the quest for maximum performance, we need to eliminate as many pre-processing steps as

possible to speed up the process of replying to the web clients. One interesting investigation is the

possibility of eliminating the need for master nodes and distributing their functionalities among traffic

nodes. With that model, we can eliminate a step of pre-processing the request; however, we create

additional challenges such as how to perform efficient traffic distribution and provide cluster services.

One interesting approach is to elect traffic nodes to perform selective master nodes functions based on

a well-defined criteria.

6.2.9 Supporting Cluster Zones and Specialized Traffic Nodes

A cluster can host several applications that provide a range of services for the cluster users. This

requirement influences the way we manage the cluster and implies the need for node specialization,

where specific nodes that constitute a cluster zone are dedicated to task-specific services. The concept

of cluster zone is as follows: since the cluster is composed of multiple nodes, we divide the cluster

into sub-clusters, called cluster zones or simply zones. Each zone provides a specific type of service

through defined cluster nodes and as a result, it receives specific type of traffic to those nodes. We

224

have two main challenges in this area: the first is to provide the virtualization of the cluster zones, and

the second is the ability to migrate dynamically cluster nodes between several zones based on traffic

trends.

Figure 131: The architecture logical view with specialized nodes

Figure 131 illustrates the concept of cluster zones. The cluster in the figure consists of three zones:

one provides HTTP service, the other provides FTP service, and the third provides streaming service.

In (1) the FTP cluster zone is receiving traffic, while some nodes in the HTTP cluster zone are sitting

idle due to low traffic. The traffic manager running on the master node disconnects (2) two nodes

from the HTTP cluster zone and transition (3) them into the FTP cluster zone to accommodate the

increase in FTP traffic. There are several possible areas of investigations such as defining cluster

Master Node B

LAN 1

CLUSTER

VIP

Master Node A

HTTP Node 2

HTTP Node 4

FTP Node 2

LAN 2

Storage Node 2

Storage Node 1

HTTP Node 1

HTTPNode 3

FTP Node 1

FTP Node 3

HTTPNode 5

Streaming Node 1

StreamingNode 2 1

Master Node B

LAN 1

CLUSTER

VIP

HTTP Node 1

HTTPNode 3

Master Node A

FTP Node 1

FTP Node 3

HTTPNode 5

HTTP Node 2

HTTP Node 4

FTP Node 2

LAN 2

Streaming Node 1

StreamingNode 2

Storage Node 2

Storage Node 1

2

Master Node B

LAN 1

CLUSTER

VIP

HTTP Node 1

HTTPNode 3

Master Node A

FTP Node 3

FTP Node 5

FTPNode 1

HTTP Node 2

FTPNode 2

FTP Node 4

LAN 2

Streaming Node 1

StreamingNode 2

Storage Node 2

Storage Node 1

3

Master Node B

LAN 1

CLUSTER

VIP

Master Node A

HTTP Node 2

HTTP Node 4

FTP Node 2

LAN 2

Storage Node 2

Storage Node 1

HTTP Node 1

HTTPNode 3

FTP Node 1

FTP Node 3

HTTPNode 5

Streaming Node 1

StreamingNode 2 1

Master Node B

LAN 1

CLUSTER

VIP

HTTP Node 1

HTTPNode 3

Master Node A

FTP Node 1

FTP Node 3

HTTPNode 5

HTTP Node 2

HTTP Node 4

FTP Node 2

LAN 2

Streaming Node 1

StreamingNode 2

Storage Node 2

Storage Node 1

2

Master Node B

LAN 1

CLUSTER

VIP

HTTP Node 1

HTTPNode 3

Master Node A

FTP Node 3

FTP Node 5

FTPNode 1

HTTP Node 2

FTPNode 2

FTP Node 4

LAN 2

Streaming Node 1

StreamingNode 2

Storage Node 2

Storage Node 1

3

225

zones as logical entities in a larger cluster, dynamic node(s) selection to be part of a specialized

cluster zone, transitioning the node into the new zone, and investigating queuing theories suitable for

such usage models.

6.2.10 Peer-to-Peer Connection Synchronization

The peer-to-peer connection synchronization is the improved approach of the master/slave approach.

In the peer-to-peer approach to connection synchronization, each master node sends synchronization

information for connections that it is handling to the other master node. In the event that a master

node fails, connections are synchronized, and connections are able to continue after the multiple

failovers.

6.2.11 Eliminating Traffic Manager as a SPOF

The traffic managers running on the master nodes in the HA tier presents a SPOF. In the event a

traffic manager fails on the active node, the traffic client (running on traffic nodes) continues

reporting its load index to the failed traffic manager until it receives a timeout.

6.2.12 Eliminating Saru Module as a SPOF

The saru module is the system software that runs on master nodes. It distributes incoming traffic

between the two master nodes (when in the 1+1 active/active model) following the round robin

method. The saru module presents a SPOF. If the saru module fails on the master node, then the

routed process will not be able to send the saru module incoming requests and the cluster comes to

halt. The solution to this problem is to have the routed process aware that the second master node also

has an inactive copy of the saru module running on it. When the routed process fails to send requests

to the saru module running on master node 1 then it should send them to the saru module running on

master node 2.

6.2.13 Failure of LDirectord

The goal of this future work is to improve the current implementation of the LDirectord and make it

resilient to failures.

226

6.3 Conclusion

This dissertation covers a range of technologies for highly available and scalable web clusters. It

addresses the challenges of designing a scalable and highly available web server architecture that is

flexible, components based, reliable and robust under heavy loads.

The first chapter provides a background on Internet and web servers, scalability challenges, and

presents the objectives and scope of the study.

The second chapter looks at clustering technologies, scalability challenges, and related work. We

examine clustering technologies and techniques for designing and building Internet and web servers.

We argue that traditional standalone server architecture fails to address the scalability and high

availability need for large-scale Internet and web servers. We introduce software and hardware

clustering technologies, their advantages and drawbacks, and discuss our experience prototyping a

highly available, and scalable, clustered web server platform. We present and discuss the various

ongoing research projects in the industry and academia, their focus areas, results, and contributions.

In the third chapter, the thesis summarizes the preparatory technical work with the prototyped web

cluster that uses existing components and mechanisms.

Chapter four presents and discusses the HAS architecture, its components and their characteristics,

eliminating single points of failure, the conceptual, physical, and scenario architecture views,

redundancy models, cluster virtual interface and traffic distribution scheme. The HAS architecture

consists of a network of server nodes connected over highly available networks. A virtual IP interface

provides a single point of entry to the cluster. The software and hardware components of the

architecture do not present a SPOF. The HAS cluster architecture supports multiple redundancy

models for each of its tiers allowing you to choose the best redundancy model for your specific

deployment scenario. The HAS architecture manages incoming traffic from web clients through a

lightweight, efficient, and dynamic traffic distribution scheme that takes into consideration the

capacity of each traffic. This approach has proven to be an effective method to distribute traffic based

on the performance testing we conducted.

In chapter five, we validate the scalability of the architecture. Our results demonstrate that the HAS

architecture is able to reach close to linear scaling for up to 16 processors and attain high performance

levels with a robust behavior under heavy load. In addition, the chapter presents the results of the

availability validation testing the availability features in a HAS architecture.

The final chapter illustrates the contributions and future work.

227

The HAS architecture brings together aspects of high availability, concurrency, dynamic resource

management and scalability into a coherent framework. Our experience and evaluation of the

architecture demonstrate that the approach is an effective way to build robust, highly available, and

scalable web clusters. We have developed an operational prototype based on the HAS architecture;

the prototype focused on building a proof-of-concept for the HAS architecture that consists of a set of

necessary system software components.

The HAS architecture relies on the integration of many system components into a well-defined, and

generic cluster platform. It provides the infrastructure for cluster membership to recognize and

manage the nodes membership in the cluster, cluster storage service, fault management service to

recognize hardware and software faults and recovery mechanisms, and traffic distribution service to

distribute the incoming traffic across the nodes in the cluster. The HAS architecture represents a new

design point for large-scale Internet and web servers that support scalability, high availability, and

high performance.

228

Bibliography

[1] K. Coffman, A. Odlyzko, The Growth Rate of the Internet, Technical Report, First Monday,

Volume 3 Number 10, October 1998, http://www.firstmonday.dk/issues/issue3_10/coffman

[2] E. Brynjolfsson, B. Kahin, Understanding the Digital Economy: Data, Tool, and Research, MIT

Press, October 2000

[3] Yahoo! Fourth Quarter 2001 Financial Report, http://docs.yahoo.com/docs/pr/4q01pr.html

[4] America Online, Press Data Points, http://corp.aol.com/press/press_datapoints.html

[5] K. McNaughton, Is eBay too popular?, CNET News.com, March 1, 1999,

http://news.cnet.com/news/0-1007-200-339371.html

[6] E. Hansen, Email outage takes toll on Excite@Home, CNET News.com, June 28, 2000,


[7] Bloomberg News, E*Trade hit by class-action suit, CNET News.com, February 9, 1999,


[8] W. LeFebvre, Facing a World Crisis, Invited talk at the 15th USENIX LISA System

Administration Conference, San Diego, California, USA, December 2-7, 2001

[9] British Broadcasting Corporation, Net surge for news sites, September 2001,

http://news.bbc.co.uk/hi/english/sci/tech/newsid_1538000/1538149.stm

[10] R. Lemos, Web worm targets White House, CNET News.com, July 2001,

http://news.com.com/2100-1001-270272.html

[11] The Hyper Text Transfer Protocol Standardization at the W3C, http://www.w3.org/Protocols

[12] The Common Gateway Interface, http://www.w3.org/CGI

[13] J. Nielsen, The Need for Speed, Technical Report, March 1997,

http://www.useit.com/alertbox/9703a.html

[14] R. B. Miller, Response Time in Man-Computer Conversational Transactions, Proceedings of

the FIPS Fall Joint Computer Conference, Vol. 33, 1968, pp. 267-277

[15] J. Nielsen, Usability Engineering, Morgan Kaufmann, San Francisco, 1994

229

[16] Inktomi Corporation, Web surpasses one billion documents, Press Release, January 2000

http://www.inktomi.com/new/press/2000/billion.html

[17] A. T. Saracevic, Quantifying the Internet, San Francisco Examiner, November 5, 2000,

http://www.sfgate.com

[18] Nielsen/NetRatings, Nielsen/NetRatings finds strong global growth in monthly Internet

sessions and time spent online between April 2001 and April 2002, June 2002,

http://www.nielsen-netratings.com/pr/pr_020610_global.pdf

[19] The WebBench Tool, http://www.etestinglabs.com/benchmarks/webbench/webbench.asp

[20] The Linux-HA Project: http://www.linux-ha.org

[21] The HA-OSCAR project, http://xcr.cenit.latech.edu/ha-oscar

[22] The Linux Operating System, http://www.kernel.org

[23] The Linux Virtual Server Project, http://www.linuxvirtualserver.org

[24] The Apache Project, http://httpd.apache.org

[25] The Tomcat Project, http://Jakarta.apache.org/tomcat

[26] The Carrier Grade Linux Initiative, http://www.osdl.org/lab_activities/carrier_grade_linux

[27] The Open Source Development Lab, http://www.osdl.org

[28] G. Pfister, In Search of Clusters, Second Edition, Prentice Hall PTR, 1998

[29] The Open Group, The UNIX® Operating System: A Robust, Standardized Foundation for

Cluster Architectures, White Paper, June 2001, http://www.unix.org/whitepapers/cluster.htm

[30] I. Haddad, E. Paquin, MOSIX: A Load Balancing Solution for Linux Clusters, Linux Journal,

May 2001

[31] The MOSIX Project, http://www.mosix.org/

[32] The ROCKS Clustering Package, http://www.rocksclusters.org

[33] Open Source Cluster Application Resources, http://oscar.openclustergroup.org

230

[34] M. J. Brim, T. G. Mattson, and S. L. Scott, OSCAR: Open Source Cluster Application

Resources, Ottawa Linux Symposium 2001, Ottawa, Canada, July 2001

[35] J. Hsieh, T. Leng, and Y.C. Fang, OSCAR: A Turnkey Solution for Cluster Computing, Dell

Power Solutions, Issue 1, 2001, pp. 138-140

[36] Ganglia Distributed Monitoring Tool, http://ganglia.sourceforge.net

[37] TurboLinux Cluster Server, http://www.turbolinux.com/products/middleware/tlcs8.html

[38] OpenSSI Clustering Solution, http://openssi.org/cgi-bin/view?page=openssi.html

[39] OpenGFS clustered file system, http://opengfs.sourceforge.net

[40] Oracle cluster file system, http://oss.oracle.com/projects/ocfs/

[41] IEEE Task Force on Cluster Computing High Availability, http://www.clustercomputing.org

[42] Advanced Research on Internet E-Servers, http://www.linux.ericsson.ca

[43] Trilliurn Digital Systems, Distributed Fault-Tolerant and High-Availability Systems White

Paper, http://www.trilliurn.com

[44] Intel Corporation, http://www.intel.com

[45] X. Gan, T. Schroeder, S. Goddard, and B. Ramamurthy, LSMAC vs. LSNAT: Scalable

cluster-based Web servers, IEEE Cluster Computing, November 2000, pp. 175-185

[46] O. Damani, P. Chung, Y. Huang, C. Kintala, and Y. M. Wang, ONE-IP: Techniques for

Hosting a Service on a Cluster of Machines, IEEE Computer Networks, Volume 29, Numbers 8-

13, September 1997, pp. 1019-1027

[47] G. Hunt, G. S. Goldszmidt, R. P. King, and R. Mukherjee, Network Dispatcher: A

Connection Router for Scalable Internet Services, IEEE Computer Networks, Volume 30,

Numbers 1-7, April 1999, pp. 347-357

[48] RFC 2391, Load Sharing using IP Network Address Translation (LSNAT),

http://www.faqs.org/rfcs/rfc2391.html

231

[49] X. Gan, T. Schroeder, S. Goddard, and B. Ramamurthy, LSMAC and LSNAT: Two

Approaches for Cluster-Based Scalable Web Servers, EEE International Conference on

Communications, June 2000, pp. 1164-1168

[50] Cisco Local Director, http://www.cisco.com/warp/public/cc/pd/cxsr/400/index.shtml

[51] DNS RFCs, http://www.dns.net/dnsrd/rfc

[52] The Public DNS Service, http://soa.granitecanyon.com

[53] The Microsoft Cluster Server,

http://www.microsoft.com/ntserver/support/faqs/Clusterfaq_deploy.asp

[54] The Microsoft cluster server FAQ,

http://www.microsoft.com/ntserver/support/faqs/Clustering_faq.asp

[55] S. Horman, Linux Virtual Server Tutorial, http://www.ultramonkey.org/papers/lvs_tutorial

[56] M. Williams, EBay, Amazon, Buy.com hit by Internet attacks, Network World, February 9,

2000, http://www.nwfusion.com/news/2000/0209attack.html

[57] G. Sandoval and T. Wolverton, Leading Web Sites Under Attack, News.com, February 9,

2000, http://news.cnet.com/news/0-1007-200-1545348.html

[58] Trilliurn Digital Systems, Distributed Fault-Tolerant/High-Availability Systems, White

Paper, http://www.trilliurn.com

[59] D. LaLiberte, and A. Braverman, A Protocol for Scalable Group and Public Annotations,

Computer Networks and ISDN Systems, Volume 27, Number 6, January 1995, pp. 911-918

[60] IMEX Research High Availability Report, http://www.imexresearch.com

[61] The Mobile Internet, http://www.ericsson.com/mobileinternet

[62] H. Soliman, IPv6 in 3G Technical Report, Ericsson Research, 2001,

http://www.ericsson.com/technology

[63] L. Aversa, and A. Bestavros, Load Balancing a Cluster of Web Servers Using Distributed

Packet Rewriting, Proceedings of IEEE International Performance Conference, Phoenix, Arizona,

USA, February 2000, pp. 24-29

232

[64] S. N. Budiarto, and S. Nishio, MASEMS: A Scalable and Extensible Multimedia Server, The

1999 International Symposium on Database Applications in Non-Traditional Environments,

Kyoto, Japan, November 1999, pp. 28-30

[65] C. Roe, and S. Gonik, Server-Side Design Principles for Scalable Internet Systems, IEEE

Software, Volume 19, Number 2, March/April 2002, pp. 34-41

[66] D. Norman, The Design of Everyday Things, Double-Day, New York, 1998

[67] D. Dias, W. Kish, R. Mukherjee, and R. Tewari, A Scalable and Highly Available Web

Server, Proceedings of the Forty-First IEEE Computer Society International Conference:

Technologies for the Information Superhighway, Santa Clara, California, USA, February 25-28,

1996, pp. 85-92

[68] E. Casalicchio, and S. Tucci, Static and Dynamic Scheduling Algorithms for Scalable Web

Server Farm, IEEE Network 2001, pp. 368-376

[69] A. S. Z. Belloum, E. C. Kaletas, A. W. Van Halderen, H. Afsarmanesh, and A. J. H.

Peddemors, A Scalable Web Server Architecture, IEEE World Wide Web Journal, Volume 5

Number 1, 2002, pp. 5-23

[70] H. Bryhni, E. Klovning, and O. Kure, A Comparison of Load Balancing Techniques for

Scalable Web Servers, IEEE Network, July/August 2000, pp. 58-64

[71] D. Kim, C. H. Park, and D. Park, Request Rate Adaptive Dispatching Architecture for

Scalable Internet Server, IEEE International Conference on Cluster Computing, Chemmnitz,

2000, pp. 289-296

[72] L. Aversa, and A. Bestavros, Load Balancing a Cluster of Web Servers Using Distributed

Packet Rewriting, Proceedings of the 2000 IEEE International Performance, Computing, and

Communications Conference, February 2000, pp. 24 - 29

[73] B. Ramamurthy, LSMAC vs. LSNAT: Scalable Cluster-based Web Servers, Seminar presented

at Rice University, http://www-ece.rice.edu/ece/colloq/00-01/Oct23br-00.html, October 23, 2000

[74] A. N. Murad, and H. Liu, Scalable Web Server Architectures, Technical Report BL0314500-

961216TM, Bell Labs, Lucent Technologies, December 1996

233

[75] E. D. Katz, M. Butler, and M. McGrawth, A Scalable HTTP Server: The NCSA Prototype,

Proceedings of the 1st International WWW Conference, Geneva, Switzerland, May 25-27, 1994,

pp. 155-164

[76] The NCSA HTTP server, http://hoohoo.ncsa.uiuc.edu

[77] D. Kim, C. H. Park, and D. Park, Request Rate Adaptive Dispatching Architecture for

Scalable Internet Server, IEEE Network 2000, pp. 289-296

[78] D. Anderson, T. Yang, V. Holmedahl, and O. Jbarra, SWEB: Towards a Scalable World Wide

Web Server on Multicomputers, Proceedings of the 10th International Parallel Processing

Symposium, Honolulu, Hawaii, USA, April 15-19, 1996, pp. 850-856

[79] E. Casalicchio, and M. Colajanni, Scalable Web Clusters with Static and Dynamic Contents,

Proceedings of the IEEE Conference on Cluster Computing, Chemnitz, Germany, November 28 –

December 1, 2000, pp. 170-177

[80] X. Gan, T. Schroeder, S. Goddard, and B. Ramamurthy, LSMAC and LSNAT: Two

Approaches for Cluster-based Scalable Web Servers, Proceedings of the 2000 IEEE International

Conference on Communications, New Orleans, USA, June 18-22, 2000, pp. 1164-1168

[81] X. Zhang, M. Barrientos, B. Chen, and M. Seltzer, HACC: An Architecture for Cluster-Based

Web Servers, Proceedings of the 3rd USENIX Windows NT Symposium, Seattle, Washington,

USA, July 12-15, 1999, pp. 155-164

[82] The HACC Web Server Project, http://www.eecs.harvard.edu/~cxzhang/projects/hacc

[83] The IBM RS/6000, http://www.rs6000.ibm.com

[84] G. D. H. Hunt, G.S. Goldszmidt, R. P. King, and R. Mukherjee, Network Dispatcher: A

Connection Router for Scalable Internet Services, Proceedings of 7th International World Wide

Web Conference, Bribane, Australia, April 1998, pp. 347-357

[85] The Alexandria Digital Library, http://alexandria.sdc.ucsb.edu

[86] WebStone Web Server Benchmarking Tool, http://www.mindcraft.com/webstone

[87] F5 Networks Big IP, http://www.f5.com/f5products/bigip

234

[88] A. Tucker, and A. Gupta, Process Control and Scheduling Issues for Multiprogrammed

Shared-Memory Multiprocessors, Proceedings of the 12th Symposium on Operating Systems

Principles, ACM, Litchfield Park, Arizona, USA, December 1989, pp. 159-166

[89] M. Squillante, and E. Lazowska, Using Processor-Cache Affinity Information in Shared-

Memory Multiprocessors Scheduling, IEEE Transactions on Parallel and Distributed Systems,

Volume 4 Number 2, February 1993, pp. 131-134

[90] Microsoft Developer Network Platform SDK, Performance Data Helper, Microsoft, July

1998

[91] W. B. Ligon III, and R. Ross, Server-Side Scheduling in Cluster Parallel I/O Systems, The

Calculateurs Parallèles Journal, October 2001

[92] W.B. Ligon III, and R. Ross, PVFS: Parallel Virtual File System, Beowulf Cluster

Computing with Linux, MIT Press, November 2001, pp. 391-430

[93] B. Nishio, and S. Nishio, MASEMS: A Scalable and Extensible Multimedia Server, IEEE

Network 2000, pp. 443-450

[94] A. Mourad, and H. Liu, Scalable Web Server Architectures, Proceedings of IEEE

International Symposium on Computers and Communications, Alexandria, Egypt, July 1997, pp.

12-16

[95] M. Andreolini, V. Cardellini, and M. Colajanni, Benchmarking, Models and Tools for

Distributed Web-Server System, Proceedings of the Performance 2002, Rome, Italy, July 24-26,

2002, pp. 208-235

[96] I. Haddad, An Overview of the HTTP, http://www.cs.concordia.ca/~i_haddad/phd.html

[97] P. Wainwright, Professional Apache, Wrox Press Inc, 1999

[98] B. Laurie, P. Laurie, and R. Denn, Apache: The Definitive Guide, O'Reilly & Associates,

1999

[99] I. Haddad, Apache User Authentication, Linux Journal, October 2000

[100] I. Haddad, Apache 2.0 Internals, Linux Journal, August 2001

[101] Netcraft Web Server Survey, http://Netcraft.com/survey

235

[102] The GNU General Public License, http://www.gnu.org/copyleft/gpl.html

[103] I. Haddad, W. Hassan, and L. Tao, XWPT: An X-based Web Servers Performance Tool, the

18th International Conference on Applied Informatics, Innsbruck, Austria, February 2000, pp. 50-

55

[104] The Software RAID How-to, http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html

[105] The Parallel Virtual File System, http://www.parl.clemson.edu/pvfs

[106] A. Ching, A. Choudhary, W. Liao, R. Ross, and W. Gropp, Noncontiguous I/O through

PVFS, Proceedings of the 2002 IEEE International Conference on Cluster Computing, September

23-26, 2002, Chicago, Illinois, USA, pp. 405-414

[107] I. Haddad, and M. Pourzandi, Open Source Web Servers Performance on Carrier-Class

Linux Clusters, Linux Journal, April 2001, pp. 84-90

[108] I. Haddad, PVFS: A Parallel Virtual File System for Linux Clusters, Linux Journal,

December 2000

[109] P. Barford, and M. Crovella, Generating Representative WebWorkloads for Network and

Server Performance Evaluation, Proceedings of the ACM Sigmetrics Conference, Madison,

Wisconsin, USA, June 1998, pp. 151-160

[110] The S-Client, http://www.cs.rice.edu/CS/Systems/Web-measurement/paper/node9.html

[111] WebStone Web Server Benchmarking Tool, http://www.mindcraft.com/webstone

[112] SPEC Web Server Benchmarking Tool, http://www.spec.org/osg/web99

[113] TCP Westwood, http://www.cs.ucla.edu/NRL/hpi/tcpw/

[114] M. Andreolini, V. Cardellini, and M. Colajanni, Benchmarking Models and Tools for

Distributed Web-Server Systems, Proceedings of Performance 2002, Rome, Italy, July 24-26,

2002, pp. 208-235

[115] E. Marcus, and H. Sten, BluePrints for High Availability: Designing Resilient Distributed

Systems, Wiley, 2000

[116] S. Horman, Connection Synchronization, http://www.ultramonkey.org/papers/conn_sync

236

[117] I. Haddad, C. Leangsuksun, R. Libby, and S. Scott, HA-OSCAR: Towards Non-stop Services

in High End and Grid computing Environments, Poster Presentation at the Fifth Los Alamos

Computer Science Institute Symposium, New Mexico, USA, October 12-14, 2004

[118] The Open Cluster Group, How to Install an OSCAR Cluster, Technical Report, November 3,

2005, http://oscar.openclustergroup.org/public/docs/oscar4.2/oscar4.2-install.pdf

[119] A. S. Tanenbaum, and M. S. Van, Distributed Systems: Principles and Paradigms, Prentice

Hall, July 2001, pp. 371-375

[120] The rsync utility, http://samba.anu.edu.au/rsync

[121] The rsync algorithm, http://dp.samba.org/rsync/tech_report/node2.html

[122] The DRDB Tool, http://www.drbd.org

[123] The Router Advertisement Daemon Project, http://v6web.litech.org/radvd/

[124] RFC 2461, Neighbor Discovery for IP Version 6, http://www.faqs.org/rfcs/rfc2461.html

[125] The Linux-HA project, http://www.linux-ha.org

[126] L. Marowsky-Brée, A New Cluster Resource Manager for Heartbeat, UKUUG LISA/Winter

Conference High Availability and Reliability, Bournemouth, UK, February 2004

[127] A. L. Robertson, The Evolution of the Linux-HA Project, UKUUG LISA/Winter Conference

High-Availability and Reliability, Bournemouth, UK, February 25-26, 2004

[128] A. L. Robertson, Linux-HA Heartbeat Design, Proceedings of the 4th International Linux

Showcase and Conference, Atlanta, October 10-14, 2000

[129] S. Horman, Connection Synchronisation (TCP Fail-Over), Technical Paper, November 2003

[130] The BogoMIPS Representation, http://www.tldp.org/HOWTO/BogoMips/x78.html

[131] D. Gordon, and I. Haddad, Apache talking IPv6, Linux Journal, January 2003

[132] I. Haddad, The Future of IP Is Now, LinuxWorld Magazine, February 2004

[133] I. Haddad, IPv6 on Linux: Ongoing Development Effort and Tutorial, Linux User and

Developer, June 2003

237

[134] D. Gamerman, Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference,

Chapman & Hall/CRC Press, Boca Raton, Fl., USA, 1997

[135] G. Ciardo, and P. Darondeau, Applications and Theory of Petri Nets 2005, 26th International

Conference, ICATPN 2005, Miami, USA, June 20-25, 2005

[136] G Ciardo, J. Muppala, and K. Trivedi, SPNP: Stochastic Petri Net Package, Proceedings of

the International Workshop on Petri Nets and Performance Models, IEEE Computer Society

Press, Los Alamitos, Ca., USA, December 1989, pp. 142-150

[137] H. Choi, Markov Regenerative Stochastic Petri Nets, Computer Performance Evaluation,

Vienna 1994, pp. 337-357

[138] C. Hirel, R. Sahner, X. Zang, and K. S. Trivedi, Reliability and Performability Modeling

using SHARPE 2000, Computer Performance Evaluation/TOOLS 2000, Schaumburg, US, March

2000, pp. 345-349

[139] C.Hirel, B. Tuffin, and K.S. Trivedi, SPNP: Stochastic Petri Nets Version 6.0, Computer

Performance Evaluation/TOOLS 2000, Schaumburg, US, March 2000, pp. 354-357

[140] C. Leangsuksun, L. Shen, T. Liu, H. Song, and S. Scott, Availability Prediction and Modeling

of High Availability OSCAR Cluster, IEEE International Conference on Cluster Computing, Hong

Kong, China, December 2-4, 2003, pp. 227-230

[141] C. Leangsuksun, Highly Reliable Linux HPC Clusters: Self-awareness Approach,

Proceedings of the International Symposium on Parallel Architectures, Algorithms, and Networks

(I-SPAN), Hong Kong, China, May 2004

[142] C. Leangsuksun, L. Shen, T. Liu, H. Song, and S. Scott, Dependability Prediction of High

Availability OSCAR Cluster Server, The 2003 International Conference on Parallel and

Distributed Processing Techniques and Applications, Las Vegas, Nevada, USA, 2003, pp. 23-26

[143] C. Leangsuksun, L. Shen, T. Lui, and S. L. Scott, Achieving High Availability and

Performance Computing with an HA-OSCAR Cluster, Future Generation Computer System,

Volume 21, Number 1, January 2005, pp. 597-606

[144] The Open Cluster Group, http://www.openclustergroup.org

238

[145] C. Leangsuksun, L. Shen, H. Song, and S. Scott, I. Haddad, The Modeling and Dependability

Analysis of High Availability OSCAR Cluster System, The 17th Annual International Symposium

on High Performance Computing Systems and Applications, Sherbrooke, Quebec, Canada, May

11-14, 2003

[146] I. Haddad, C. Leangsuksun, R. Libby, T. Liu, Y. Liu, and S. Scott, Highly Reliable Linux

HPC Clusters: Self-awareness Approach, Proceedings of the 2nd International Symposium on

Parallel and Distributed Processing and Applications, Hong Kong, China, December 13-15, 2004,

pp. 217-222

[147] The SystemImager utility, http://www.systemimager.org

[148] I. Haddad, and C. Leangsuksun, Building Highly Available HPC Clusters with HA-OSCAR,

Tutorial Presentation, the 6th LCI International Conference on Clusters: The HPC Revolution

2005, Chapel Hill, NC, USA 2005, April 2005

[149] I. Haddad, and G. Butler, Experimental Studies of Scalability in Clustered Web Systems,

Proceedings of the International Parallel and Distributed Processing Symposium 2004, Santa Fe,

New Mexico, USA, April 2004

[150] I. Haddad, Keeping up with Carrier Grade, Linux Journal, August 2004

[151] I. Haddad, Carrier Grade Server Requirements, Linux User and Developer, August 2004

[152] I. Haddad, Moving Towards Open Platforms, LinuxWorld Magazine, May 2004

[153] I. Haddad, Linux Gains Momentum in Telecom, LinuxWorld Magazine, May 2004

[154] I. Haddad, OSDL Carrier Grade Linux, O'Reilly Network, April 2004

[155] I. Haddad, CGL Platforms: Characteristics and Development Efforts, Euro-Par 2003,

Klagenfurt, Austria, August 2003

[156] I. Haddad, C. Leangsuksun, M. Pourzandi, and A. Tikotekar, Feasibility Study and Early

Experimental Results Toward Cluster Survivability, Proceedings of Cluster Computing and Grid

2005, Cardiff, UK, May 9-12, 2005

[157] D. Gordon, and I. Haddad, Building an IPv6 DNS Server Node, Linux Journal, October 2003

239

[158] I. Haddad, C. Leangsuksun, M. Pourzandi, and A. Tikotekar, Experimental Results in

Survivability of Secure Clusters, Proceedings of the 6th International Conference on Linux

Clusters, Chapel Hill, NC, USA 2005

[159] I. Haddad, IPv6: The Essentials You Must Know, Linux User and Developer, May 2003

[160] I. Haddad, Using Freenet6 Service to Connect to the IPv6 Internet, Linux User and

Developer, July 2003

[161] I. Haddad, C. Leangsuksun, R. Libby, T. Liu, Y. Liu, and S. L. Scott , High-Availability and

Performance Clusters: Staging Strategy, Self-Healing Mechanisms, and Availability Analysis,

Proceedings of the IEEE Cluster Conference 2004, San Diego, USA, September 20-23, 2004

[162] I. Haddad, Streaming Video on Linux over IPv6, Linux User and Developer, August 2003

[163] I. Haddad, Voice over IPv6 on Linux, Linux User and Developer, September 2003

[164] I. Haddad, NAT-PT: IPv4/IPv6 and IPv6/IPv4 Address Translation, Linux User and

Developer, October 2003

[165] I. Haddad, Design and Implementation of HA Linux Clusters, IEEE Cluster 2001, Newport

Beach, USA, October 8-11, 2001

[166] I. Haddad, Designing Large Scale Benchmarking Environments, ACM Sigmetrics 2002,

Marina Del Rey, USA, June 2002

[167] I. Haddad, Supporting IPv6 on Linux Clusters, IEEE Cluster 2002, Chicago, USA, September

2002

[168] I. Haddad, IPv6: Characteristics and Ongoing Research, Internetworking 2003, San Jose,

USA, June 2003

[169] I. Haddad, CGL Platforms: Characteristics and Development Efforts, Euro-Par 2003,

Klagenfurt, Austria, August 2003

[170] I. Haddad, Carrier Grade Linux: Status and Ongoing Work, Real World Linux 2004,

Toronto, Canada, April 2004

[171] I. Haddad, Carrier Grade Platforms: Characteristics and Ongoing Efforts, ICETE 2004,

Setúbal, Portugal, August 2004

240

[172] I. Haddad, and C. Leangsuksun, Building HA/HPC Clusters with HA-OSCAR, Tutorial

Presentation at the IEEE Cluster Conference, San Diego, USA, September 2004

[173] I. Haddad, C. Leangsuksun, and S. Scott, Towards Highly Available, Scalable, and Secure

HPC Clusters with HA-OSCAR, the 6th International Conference on Linux Clusters, Chapel Hill,

NC, USA, April 2005

[174] I. Haddad, and S. Scott, HA Linux Clusters: Towards Platforms Providing Continuous

Service, Linux Symposium, Ottawa, Canada, July 2005

[175] I. Haddad, and C. Leangsuksun, HA-OSCAR: Highly Available Linux Cluster at your

Fingertips, IEEE Cluster 2005, Boston, USA, September 2005

[176] I. Haddad, HA Linux Clusters, Open Cluster Group 2001, Illinois, USA, March 2001

[177] I. Haddad, Combining HA and HPC, Open Cluster Group 2002, Montréal, Canada, June 2002

[178] I. Haddad, Towards Carrier Grade Linux Platforms, USENIX 2004, Boston, USA, June 2004

[179] I. Haddad, Towards Unified Clustering Infrastructure, Linux World Expo and Conference,

San Francisco, USA, August 2005

[180] I. Haddad, Carrier Grade Linux: Status and Ongoing Work, Real World Linux 2004,

Toronto, Canada, April 2004

[181] I. Haddad, Carrier Grade Platforms: Characteristics and Ongoing Efforts, ICETE 2004,

Setúbal, Portugal, August 2004

[182] C. Leangsuksun, A Failure Predictive and Policy-Based High Availability Strategy for Linux

High Performance Computing Cluster, The 5th LCI International Conference on Linux Clusters:

The HPC Revolution 2004, Austin, USA, May 18-20, 2004

[183] TechTarget, Dictionary for Internet and Computer Technologies, http://whatis.techtarget.com

241

Glossary

The definitions of the terms appearing in this glossary are referenced from [183].

3G 3G is short for third-generation wireless, and refers to near-future developments in

personal and business wireless technology, especially mobile communications.

AAA Authentication, authorization, and accounting (AAA) is a term for a framework for

intelligently controlling access to computer resources, enforcing policies, auditing usage, and

providing the information necessary to bill for services. These combined processes are considered

important for effective network management and security. Authentication, authorization, and

accounting services are often provided by a dedicated AAA server, a program that performs these

functions. A current standard by which network access servers interface with the AAA server is the

Remote Authentication Dial-In User Service (RADIUS).

Active Node A node that is providing (or capable of providing) a service.

Active/active A redundancy configuration where all servers in the cluster run their own

applications but are also ready to take over for failed server if needed.

Active/standby A redundancy configuration where one server is running the application while

another server in the cluster is idle but ready to take over if needed.

ASP Application service provider (ASP) is a company that offers individuals or

enterprises access over the Internet to applications and related services that would otherwise have to

be located in their own personal or enterprise computers.

Availability Availability is the amount of time that a system or service is provided in relation to

the amount of time the system or service is not provided. Availability is commonly expressed as a

percentage.

242

C++ C++ is an object-oriented programming language.

C C is a structured, procedural programming language that has been widely used for both

operating systems and applications and that has had a wide following in the academic community.

CGI The common gateway interface (CGI) is a standard way for a Web server to pass a

Web user's request to an application program and to receive data back to forward to the user.

Cluster/server The client/server describes the relationship between two computer programs in which

one program, the client, makes a service request from another program, the server, which fulfills the

request.

Cluster A cluster is a collection of cluster nodes that my change dynamically as nodes join or

leave the cluster.

COTS Commercial off-the-shelf describes ready-made products that can easily be obtained.

Cluster Two or more computer nodes in a system used as a single computing entity to

provide a service or run an application for the purpose of high availability, scalability, and

distribution of tasks.

CMS Cluster Management System (CMS) is a management layer that allows the whole

cluster to be managed as a single entity.

Daemon A program that runs continuously in the background, until activated by a

particular event. A daemon can constantly query for requests or await direct action from a user or

other process.

DRAM Dynamic random access memory (DRAM) is the most common random access

memory (RAM) for personal computers and workstations.

243

DRBD Disk Replication Block Device

DNS The domain name system (DNS) is the way that Internet domain name are located

and translated into IP addresses. A domain name is a meaningful and easy-to-remember "handle" for

an Internet address.

DIMM A DIMM (dual in-line memory module) is a double SIMM (single in-line memory

module). Like a SIMM, it is a module containing one or several random access memory (RAM) chips

on a small circuit board with pins that connect it to the computer motherboard.

EIDE Enhanced (sometimes "Expanded") IDE is a standard electronic interface between a

computer and its mass storage drives. EIDE's enhancements to Integrated Drive Electronics (IDE)

make it possible to address a hard disk larger than 528 Mbytes. EIDE also provides faster access to

the hard drive, support for Direct Memory Access (DMA), and support for additional drives.

Failover The ability to automatically switch a service or capability to a redundant node,

system, or network upon the failure or abnormal termination of the currently active node, system, or

network.

Failure The inability of a system or system component to perform a required function within

specified limits. A failure may be produced when a fault is encountered. Examples of failures include

invalid data being provided, slow response time, and the inability for a service to take a request.

Causes of failure can be hardware, firmware, software, network, or anything else that interrupts the

service.

Five nines is measured as 99.999% service availability. It is equivalent to 5 minutes a year of

total planned and unplanned downtime of the service provided by the system

FTP File Transfer Protocol (FTP) is a standard Internet protocol that defines one way of

exchanging files between computers on the Internet.

244

Gateways Gateways are bridges between two different technologies or administration domains.

A media gateway performs the critical function of converting voice messages from a native

telecommunications time-division-multiplexed network, to an Internet protocol packet-switched

network.

High Availability The state of a system having a very high ratio of service uptime compared to

service downtime. Highly available systems are typically rated in terms of number of nines such as

fivenines or sixnines.

HLR The Home Location Register (HLR) is the main database of permanent subscriber

information for a mobile network.

HTML Hypertext Markup Language (HTML) is the set of markup symbols or codes inserted

in a file intended for display on a World Wide Web browser page. The markup tells the web browser

how to display a web page's words and images for the user.

HTTP Hypertext Transfer Protocol (HTTP) is the set of rules for exchanging files (text,

graphic images, sound, video, and other multimedia files) on the World Wide Web.

HTTP-NG Hypertext Transfer Protocol – Next Generation

Internet The Internet is a worldwide system of computer networks - a network of

networks in which users at any one computer can, if they have permission, get information from any

other computer (and sometimes talk directly to users at other computers). It was conceived by the

Advanced Research Projects Agency (ARPA) of the U.S. government in 1969 and was first known as

the ARPANET. The Internet is a public, cooperative, and self-sustaining facility accessible to

hundreds of millions of people worldwide. Physically, the Internet uses a portion of the total

resources of the currently existing public telecommunication networks.

IP The Internet Protocol (IP) is the method or protocol by which data is sent from one

computer to another on the Internet.

245

iptables is a Linux command used to set up, maintain, and inspect the tables of IP packet filter

rules in the Linux kernel. There are several different tables, which may be defined, and each table

contains a number of built-in chains, and may contain user-defined chains. Each chain is a list of rules

which can match a set of packets: each rule specifies what to do with a packet which matches. This is

called a `target', which may be a jump to a user-defined chain in the same table.

IPv6 Internet Protocol Version 6 (IPv6) is the latest version of the Internet Protocol. IPv6

is a set of specifications from the Internet Engineering Task Force (IETF) that was designed as an

evolutionary set of improvements to the current IP Version 4.

ISDN Integrated Services Digital Network (ISDN) is a set of standards for digital

transmission over ordinary telephone copper wire as well as over other media.

I/O I/O describes any operation, program, or device that transfers data to or from a

computer.

ISP Internet service provider (ISP) is a company that provides individuals and other

companies access to the Internet and other related services such as web site building and virtual

hosting.

ISV Independent Software Vendors

LAN A local area network (LAN) is a group of computers and associated devices that

share a common communications line or wireless link and typically share the resources of a single

processor or server within a small geographic area.

Management Server Management servers handle traditional network management operations, as

well as service and customer management. These servers provide services such as Home Location

Register and Visitor Location Register (for wireless networks) or customer information, such as

personal preferences including features the customer is authorized to use.

246

MMP Massively Parallel Processors (MPP) is the coordinated processing of a program by

multiple processors that work on different parts of the program, with each processor using its own

operating system and memory. Typically, MPP processors communicate using some messaging

interface.

MP3 MP3 (MPEG-1 Audio Layer-3) is a standard technology and format for compression

a sound sequence into a very small file (about one-twelfth the size of the original file) while

preserving the original level of sound quality when it is played.

MSS Maximum Segment Size (MSS)

MTTF Mean Time To Failure (MTTF) is the interval in time which the system can provide

service without failure

MTTR Mean Time To Repair (MTTR) is the interval in time it takes to resume service after

a failure has been experienced.

NAS Network-attached storage (NAS) is hard disk storage that is set up with its own

network address rather than being attached to the department computer that is serving applications to

a network's workstation users.

NAT NAT (Network Address Translation) is the translation of an Internet Protocol address

(IP address) used within one network to a different IP address known within another network. One

network is designated the inside network and the other is the outside.

Network A connection of [nodes] which facilitates [communication] among them. Usually, the

connected nodes in a network use a well defined [network protocol] to communicate with each other.

Network Protocols Rules for determining the format and transmission of data. Examples of

network protocols include TCP/IP, and UDP.

247

NIC A network interface card (NIC) is a computer circuit board or card that is installed in

a computer so that it can be connected to a network.

Node A single computer unit, in a [network], that runs with one instance of a real or virtual

operating system.

NTP Network Time Protocol (NTP) is a protocol that is used to synchronize computer

clock times in a network of computers.

OSI The Open System Interconnection, model defines a networking framework for

implementing protocols in seven layers. Control is passed from one layer to the next, starting at the

application layer in one station, proceeding to the bottom layer, over the channel to the next station

and back up the hierarchy.

Performance The efficiency of a [system] while performing tasks. Performance characteristics

include total throughput of an operation and its impact to a [system]. The combination of these

characteristics determines the total number of activities that can be accomplished over a given amount

of time.

Perl Perl is a script programming language that is similar in syntax to the C language and

that includes a number of popular Unix facilities such as SED, awk, and tr.

PCI PCI (Peripheral Component Interconnect) is an interconnection system between a

microprocessor and attached devices in which expansion slots are spaced closely for high-speed

operation. Using PCI, a computer can support both new PCI cards while continuing to support

Industry Standard Architecture (ISA) expansion cards, an older standard.

PDA Personal digital assistant (PDA) is a term for any small mobile hand-held device that

provides computing and information storage and retrieval capabilities for personal or business use,

often for keeping schedule calendars and address book information handy.

248

Proxy Server A computer network service that allows clients to make indirect network connections

to other network services. A client connects to the proxy server, and then requests a connection, file,

or other resource available on a different server. The proxy provides the resource either by connecting

to the specified server or by serving it from a cache. In some cases, the proxy may alter the client's

request or the server's response for various purposes.

RAID Redundant array of independent disks (RAID) is a way of storing the same data in

different places (thus, redundantly) on multiple hard disks.

RAMDISK A RamDisk is a portion of memory that is allocated to be used as a hard disk

partition. RAS Reliability, availability, and servicability

Recovery To return a failing component, node or system to a working state. A failing

component can be a hardware or a software component of a node or network. Recovery can also be

initiated to work around an fault that has been detected; ultimately restoring the service.

RTT Round-Trip Times (RTT) is the time required for a network communication to travel

from the source to the destination and back. RTT is used by routing algorithms to aid in calculating

optimal routes.

SAN Storage Area Network (SAN) is a high-speed special-purpose network (or sub-

network) that interconnects different kinds of data storage devices with associated data servers on

behalf of a larger network of users.

SCP A Service Control Point server is an entity in the intelligent network that implements

service control function that operation that affects the recording, processing, transmission, or

interpretation of data.

SCSI The Small Computer System Interface(SCSI) is a set of ANSI standard electronic

interfaces that allow personal computers to communicate with peripheral hardware such as disk

249

drives, tape drives, CD-ROM drives, printers, and scanners faster and more flexibly than previous

interfaces.

Service A set of functions provided by a computer system. Examples of Telco services

include media gateway, signal, or soft switch types of applications. Some general examples of

services include web based or database transaction types of applications.

Session Series of consecutive page requests to the web server from the same user

Signaling Servers Signaling servers handle call control, session control, and radio recourse

control. A signaling server handles the routing and maintains the status of calls over the network. It

takes the request of user agents who want to connect to other user agents and routes it to the

appropriate signaling.

SLA Service Level Agreement (SLA) is a contract between a network service provider and

a customer that specifies, usually in measurable terms, what services the network service provider

will furnish.

SPOF Single point of failure (SPOF) - Any component or communication path within a

computer system that would result in an interruption of the service if it failed.

SMP Symmetric Multiprocessors (SMP) is the processing of programs by multiple

processors that share a common operating system and memory. The processors share memory and the

I/O bus or data path. A single copy of the operating system is in charge of all the processors.

SSI Single System Image (SSI) is a form of distributed computing in which by using a

common interface multiple networks, distributed databases or servers appear to the user as one

system. In SSI systems, all nodes share the operating system environment in the system.

Sixnines is measured as 99.9999% [service] [availability]. It is equivalent to 30 seconds a year

of total planned and unplanned downtime of the [service] provided by the [system].

250

Standby Not currently providing service but prepared to take over the active state.

System A computer system that consists of one computer [node] or many nodes connected

via a computer network mechanism.

Switch-over The term switch-over is used to designate circumstances where the cluster moves the

active state of a particular component/node from one component/node to another, after the failure of

the active component/node. Switch-over operations are usually the consequence of administrative

operations or escalation of recovery procedures.

Tcl Tcl is an interpreted script language developed by Dr. John Ousterhout at the

University of California, Berkeley, and now developed and maintained by Sun Laboratories.

TCP TCP (Transmission Control Protocol) is a set of rules (protocol) used with the

Internet Protocol (IP) to send data in the form of message units between computers over the Internet.

While IP takes care of handling the actual delivery of the data, TCP takes care of keeping track of the

individual units of data (called packets) that a message is divided into for efficient routing through the

Internet.

TFTP Trivial File Transfer Protocol (TFTP) is an Internet software utility for transferring

files that is simpler to use than the File Transfer Protocol (FTP) but less capable. It is used where user

authentication and directory visibility are not required.

TTL Time-to-live (TTL) is a value in an Internet Protocol (IP) packet that tells a network

router whether the packet has been in the network too long and should be discarded.

QoS Quality of Service (QoS) is the idea that transmission rates, error rates, and other

characteristics can be measured, improved, and, to some extent, guaranteed in advance.

URI To paraphrase the World Wide Web Consortium, Internet space is inhabited by many

points of content. A URI (Uniform Resource Identifier; pronounced YEW-AHR-EYE) is the way you

251

identify any of those points of content, whether it be a page of text, a video or sound clip, a still or

animated image, or a program. The most common form of URI is the web page address, which is a

particular form or subset of URI called a Uniform Resource Locator (URL).

USB USB (Universal Serial Bus) is a plug-and-play interface between a computer and

add-on devices (such as audio players, joysticks, keyboards, telephones, scanners, and printers). With

USB, a new device can be added to your computer without having to add an adapter card or even

having to turn the computer off.

User An external entity that acquires service from a computer system. It can be a human

being, an external device, or another computer system.

Web Service Web services are loosely coupled software components delivered over Internet

standard technologies. A web service can also be defined as a self-contained, modular application that

can be described, published, located, and invoked over the web.

252

Documents

The HAS Architecture: A Highly Available and Scalable ...users.encs.concordia.ca/~gregb/home/PDF/haddad-phd-thesis.pdf · The HAS Architecture: A Highly Available and Scalable Cluster