53
An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University of Dublin, in partial fulfilment of the requirements for the degree of Master of Science in Computer Science September 2001

An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University of Dublin, in partial fulfilment of the requirements for the degree of Master of Science in Computer Science September 2001

Page 2: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

II

Declaration I declare that the work described in this dissertation is, except where otherwise stated, entirely my own work and has not been submitted as an exercise for a degree at this or any other university.

___________________ Thomas Larkin 14th September 2001

Permission to lend and/or copy I agree that Trinity College Library may lend or copy this dissertation upon request.

___________________ Thomas Larkin

14th September 2001

Page 3: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

II I

Abstract The exponential growth of the Internet and the increasing demands put upon Web servers has created the need for a scalable clustered Web servers. In this paper we attempt to analyse the important issues facing the implementor of a clustered Web server. The issues of TCP handoff problems, dynamically generated Web documents and how a Distributed Shared Memory paradigm might be used to optimise a clustered Web server. We analyse several clustered Web server architectures, and use traces from Web servers in a simulation of several of the Web server architectures to further examine the caching strategies of those architectures. Finally some recommendations are put forth for further research into areas that have been identified as critical to clustered Web servers.

Page 4: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

IV

Acknowledgements

I would like to thank my supervisor Christian Jensen for his support and guidance

throughout the course of this project. I would also like to thank all my friends, family

and classmates for their friendship, support and help during the year.

Page 5: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

V

Contents

1 INTRODUCTION ............................................................................................7

2 STATE OF THE ART OVERVIEW ...............................................................7

2.1 WEB SERVER CACHING STRATEGIES ............................................................7 2.1.1 Least Recently Used (LRU) .................................................................9 2.1.2 LRU-Size...........................................................................................10 2.1.3 LRU-Min...........................................................................................10 2.1.4 Least Frequently Used (LFU)............................................................10 2.1.5 Static Caching...................................................................................11 2.1.6 Greedy Dual-Size (GDS) ...................................................................11

2.2 CACHING IN A CLUSTERED WEB SERVER....................................................11 2.3 CLUSTERED WEB SERVERS........................................................................13

2.3.1 Layer 4/2 Clustering .........................................................................14 2.3.2 Layer 4/3 Clustering .........................................................................15 2.3.3 Layer 7 Clustering ............................................................................16

2.4 TCP HANDOFF PROTOCOL .........................................................................17 2.4.1 TCP Handoff Protocol Problems.......................................................18

2.5 CLUSTERED WEB SERVER ARCHITECTURES................................................21 2.5.1 Weighted Round-Robin (WRR) ..........................................................21 2.5.2 Harvard Array of Clustered Computers (HACC) ...............................23 2.5.3 Client-Aware Dispatching Algorithm (CAP)......................................26 2.5.4 Locality-Aware Request Distribution (LARD)....................................28 2.5.5 Parallel Pull-Based LRU (PPBL) ......................................................32

2.6 DISTRIBUTED SHARED MEMORY BASED CLUSTERED WEB SERVERS...........36

3 SIMULATION DESIGN AND REQUIREMENTS ......................................37

4 SIMULATION IMPLEMENTATION..........................................................38

4.1 RR SIMULATION........................................................................................40 4.2 LARD SIMULATION ..................................................................................41 4.3 PPBL SIMULATION....................................................................................43

5 WEB TRACES, EXAMINATION OF INPUT AND STATISTICS.............45

6 EVALUATION OF RESULTS ......................................................................47

7 CONCLUSIONS.............................................................................................49

8 BIBLIOGRAPHY...........................................................................................52

Page 6: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

VI

������������ ������������

����� � ���! "������#$��%'&)( ��*+ ,%'-/.102#"%'3546��78� �:9�����;0 ���! "������#$��%'&)( ��*+ ,%'-/.1<2#"%'3546��78� �:=�����;< ��#">?#"&�*)*"78#��/��&�*$@) "*"A�&���� �:B�����;- ,A���46���8��C"3���78ADA���46���8��C)3���&��EA�784!����* 0GF�����;9 %'&�#H "%'�I�GJLK� �(M N��7E��78OP357�46�QA���46���8��C)3�����&�* 0GR�����;= CH "4!��#$4���S3T%1 �����&�*$A�784!����* -GF�����;U ��&�35*"AD��&�C"��*+ N��#"@��I��78#��/3T�87 -N������;B %1 "�8AV N��#"@��I��78#��/3T�87 -G0�����;R >W N�X N%'%'7�%>G35%'%'KYC� N4!78AD%'��3Z N��#"@��I��78#H��35��7 -G<����� �:F 7�>[ \���X N#"7]��&�>G&�%'&��)J -G=����� �G� 4!A�4!#+�/�X N#"7^�/&�>G&�%'&��)J -G=����� �:0 4!A�4!#

% CACHE MISSES-GU

����� �:< 7�>[ ,_`#H "#"@�7ESa��4!4�7�4 -GB����� �:- %'&) "A;CH "%1 N*"#)��*"�D#"&�S>W N����4!&�*"4 -GR

Page 7: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

7

1 Introduction

In this paper an evaluation of Caching Strategies in the context of a clustered Web

server will be examined. Of particular interest is how a clustered Web server using

Distributed Shared Memory (DSM) can take advantage of the benefits of the DSM.

Cache replacement algorithms will be explored first of all, and then the

examination will expand to several clustered Web server architectures.

The architecture of a clustered Web server will be laid out, and the pros and cons of

the different approaches will be weighed up. The clustered Web server

architectures that will be examined in detail are the Harvard Array of Clustered

Computers (HACC), Locality-Aware Request Distribution (LARD), Client-Aware

Protocol (CAP), Parallel Pull-Based LRU and Weighted Round-Robin.

A simulation of LARD, PPBL and RR will be reviewed, and the results of this

simulation will allow us to gain a better knowledge of what the key areas that need

to be focused upon in designing a clustered Web Server.

2 State of the Art overview

2.1 Web Server Caching Strategies

Caching has proved to be an easy and inexpensive way to make the WWW work

faster. Web objects such as HTML documents, images, and multimedia clips can

be cached in order to improve the response time observed by the Web “surfer” .

Caches can be used at three different levels of a typical journey of a Web document

from Web server to client: at the Web server itself (in the form of a main-memory

cache), somewhere along the server to client path (in a proxy cache) or at the client

(a built-in browser cache).

The main-memory cache is of utmost importance to the overall performance of a

Web server. It is naturally this area of the Web caching domain that interests us,

and it is this that we will now proceed to investigate further. The performance of

Page 8: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

8

the Web server can be increased if requested documents are cached, as this will

reduce response time and also decrease server load.

Many existing Web servers (e.g. NCSA and Apache) rely on the underlying file

system buffer of the operating system to cache recently accessed documents. This

means that a request for a file will i nvolve the operating system buffer copying the

contents of the file to the Web server’s buffer. This means that we are dependant on

the file systems caching, which may not be the best option for Web caching. A

better solution is to use a dedicated document cache with a more efficient caching

policy (as regards to the particular needs of the web) than that used by the file

system.

Data caching in databases and file systems has been the subject of a huge amount

of research. However the strategies employed for Web server caching must

necessarily be different to those employed in database and file system caching, as

the traditional cache policies that perform well in database and file systems do not

perform well for the web.

This is due to several key differences in web caching and traditional paging

problems. First, web caching is variable sized caching. Web documents vary

dramatically in size depending on the information they carry (text, image, video,

etc). Web servers always read and cache entire files, whereas file systems and

databases deal with fixed blocks of data. Second, caching is not obligatory in Web

servers. A file system or database always places requested data blocks in the cache,

but a Web server cache manager will not chose to add a document to its cache if

this will be detrimental to the performance of the cache. Thirdly, there are no

correlated re-reads or sequential scans in Web workloads. A Web user will not re-

read the same document in a short space, at least from the perspective of the Web

server- all such re-reads will be handled by the Web browser.

These differences demonstrate the differences between traditional caching and the

requirements for Web caching. We can see that a single request for a Web

document is not a good reason for the cache manager to include it in the cache, as

Page 9: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

9

one access does not mean that this document will be accessed again in the near

future. A good cache manager needs to take these factors into account. It should

also be noted that Web servers typically deal with fewer data items than file

systems and databases do. The number of documents stored in a Web server rarely

exceeds 100,000, whereas file systems and databases deal in milli ons of blocks.

This means that the Web cache manager can afford to keep statistics on its

documents, to better evaluate which documents should be in the cache in order to

improve the overall performance of the Web server.

It is also useful to note the difference between a primary Web server and a proxy

server. A proxy server can be potentially drawing from the entire set of Web

documents available on the Web, whereas a primary Web server is dealing with

much more restricted set of documents. A primary Web server will also have a

much more skewed access patterns- there will be a more limited set of popular

documents in a primary Web server. It should also be noted that the primary Web

server stores its document cache in main memory, and so the caching algorithm

employed cannot be too complex, or else the efficiency of the algorithm and the

amount of main memory space it consumes will countermand the usefulness of a

main memory Web document cache.

The performance metric for Web caches can be either a byte hit rate or a document

hit rate. The byte hit rate is the ratio of the number of bytes of data fetched from the

cache to the total number of bytes requested. The document hit rate is the fraction

of requests satisfied from the cache. In file system and database caches, these two

metrics were one and the same, as all data items have the same size. The byte hit

rate is important for proxy caches, as it reflects network bandwidth savings. For a

primary web server, the document hit rate is of more importance, as it directly

affects the response time observed by clients. We will now examine some of the

more popular Web caching policies.

2.1.1 Least Recently Used (LRU)

This policy is inherited directly from file system caching. This policy acts upon the

premise that recently accessed objects are likely to be accessed again. Thus when a

Page 10: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

10

new request is received, the least recently used document in the cache is removed.

While this policy works admirably for fixed-size paging, it has obvious faili ngs

when applied to the Web caching domain. This policy ignores the very pertinent

issue of documents size.

2.1.2 LRU-Size

This policy [ASAWF95] uses a combination of parameters to select which

document to evict from a web cache. Here the largest documents are evicted from

the cache first. When documents are of the same size, the least recently used of

those documents is removed. This approach does have the drawback of considering

one parameter first, and then the second. This means that a document that narrowly

“ loses” on the first parameter will always be evicted, even if its second parameter is

much more in its favour. Using a logarithm of the first sorting key means that the

second sorting key is used more often. A variation on this policy which does this is

called LRU-(log2 Size) [ASAWF95].

2.1.3 LRU-Min

This policy [ASAWF95] takes the size of the new document into account when

evicting documents in the cache. This policy first checks to see if there are any

documents of size greater than or equal to the incoming document. If there are, one

of these is evicted by LRU order. Otherwise, all documents half the size or greater

are considered, and one of more of them is evicted. Then all documents a quarter of

the size are considered, until there is enough space for the incoming document.

This policy yields better cache performance, but is more CPU intensive than the

simpler LRU-Size.

2.1.4 Least Frequently Used (LFU)

This policy uses the frequency of web documents requests to select which

document to evict from the Web cache. The document that is accessed least

frequently is removed from the cache. Similar to LRU, this policy ignores the issue

of document sizes, which can lead to inefficient management of the web cache.

Page 11: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

11

2.1.5 Static Caching

In this policy [TRS97], the cache of web documents is only updated after a certain

time. The update is done using the Web server’s log to decide which documents

should be cached for the next period of time. No new documents can enter the

cache during that time period. This policy performs better than most policies given

a reasonably predictable flow of requests into the Web server, and has a very low

CPU overhead. However its static nature mean that it is always not a suitable

policy for Web servers.

2.1.6 Greedy Dual-Size (GDS)

This policy is based upon the LRU model. It has been proven to be online-optimal

[CI97]. This means that the GDS policy will perform at least as well as any other

online replacement algorithm. The results from [CI97] have shown it to outperform

existing replacement algorithms in many performance aspects, including hit ratios.

While this policy was introduced as a Web proxy cache replacement policy, it can

be easily adapted to Web server caches. As the policy fundamentally uses a

cost/size ratio to determine the worth of a document, the maximum document hit

ratio can be achieved with this policy by simply setting the cost of every document

to 1.

The Greedy Dual Size works as follows. Each cached page has an associated value

H. When a page is brought into the cache, H is set to the cost of bringing that page

into the cache. When a page is removed from the cache, the value of that page

(minH) is subtracted from all the remaining values of H in the cache. H is set of the

value of cost/size. As this is a main memory Web server cache and not a Web

proxy cache, the cost of bringing all documents is set to 1. Therefore H is set to

1/size.

2.2 Caching in a Clustered Web Server

In the context of a clustered Web server, we have to evaluate which of these

algorithms will be of most use. Policies which use predetermined fixed values

become much harder to implement in a clustered Web server, especially if the Web

server is to be scalable. These values have to be determined by running

Page 12: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

12

simulations, and then these values are strongly influential on the performance of

the Web cache. When we consider that there will be some sort of distribution

policy in a clustered Web server, it can be observed that policies that depend on

static values or on complex calculations are not suitable.

Each node in a clustered Web server will have its own Web cache in main memory,

and each of these caches will be managed by its own cache replacement algorithm.

At this level, the most important factors influencing the choice of cache

replacement algorithm are simplicity, performance (document hit ratios) and an

abili ty to adapt its cache in an online fashion. In a clustered Web server each node

does not necessarily know the complete request stream being handled by the

cluster, it only handles what is sent to it. But in a cluster it is crucial that the nodes

are able to adapt to changes in their request streams.

Consider the scenario of a locality based distribution algorithm in a clustered Web

server. Requests of type A are sent to node a, type B to node b and type C to node

c. Each of the nodes caches will adjust their caches to result in the highest

document hit ratio possible. Now consider that node a becomes overloaded, either

through an increase in the amount of requests of type A being handled by the server

or by node a experiencing a fault of some sort. Now nodes b and c will start to

receive requests of type a, as the distribution policy attempts to compensate for the

overloaded node a. If their cache replacement algorithms cannot quickly and

dynamically adapt to the change in the request stream they have been receiving, the

overall performance of the cluster will suffer greatly.

This area of main memory caching replacement algorithms should not be ignored

in the context of clustered web servers. The caching replacement algorithm chosen

to manage the back end nodes of the cluster will have a significant effect on the

document hit ratio. It was noted in [ASP98] that using LRU instead of GDS as a

cache replacement algorithm lead to a 30% reduction in throughput results. An

interesting experiment would be to analyse the effects of including a combination

of online optimal algorithms and more static cache replacement policies, such as

Static Caching. However this hybrid approach would require a specialized

Page 13: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

13

distribution policy, and as such is beyond the scope of this paper. It is likely that

such an approach would perform better than the approaches tested in this paper, but

a problem with such approaches would be a lack of easy scalabili ty.

2.3 Clustered Web Servers

The exponential growth of the Internet has created the need for more and faster

Web servers capable of serving over 100 milli on Internet users. In the past the only

solution for scaling server capacity has been to replace the old server with a bigger,

faster server. This system has the disadvantage of being very costly in that the

investment in the old server is completely wasted once an upgrade is needed. It is

in this situation that a clustered Web server is commercially a more viable

approach, and imminently more scalable. Rather than replacing the old server, we

simply add more nodes to our clustered Web server as demand for the Web service

increases.

There are many possible approaches to clustering a Web server. The first

requirement that all clustered Web servers must meet is that the cluster must be

transparent to client browers- i.e. web browsers are unaware of the server cluster;

the cluster must behave in the same manner as a single-server Web server from the

point of view of the web browser. Equally, transparency is important for the Web

server technology. Early commercial cluster-based Web servers such as Zeus and

Inkotmi [F97] did not make the clustering technology transparent to the Web server

software. They had an indissoluble whole rather than the layered architecture used

in fully transparent clustering. This means that the system needs specialized

software throughout, and this increases the cost and complexity of the system.

While these technologies outperform traditional single-server Web servers, they are

generally dependant on proprietary software.

Most of the research examined in this paper is concerned with clustering techniques

that are not only transparent to the Web client, but also to the Web server, as most

Web servers do not have any in-built clustering capabili ties. We will now examine

several high-level views of basic cluster architectures and their pros and cons. The

terms used for defining the three separate types of clustering are taken from

Page 14: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

14

[SGR00]. These types are layer four switching with layer two packet forwarding

(L4/2), layer four switching with layer three packet forwarding (L4/3) and layer

seven (L7) switching with either layer two packet forwarding (L7/2) or layer three

packet forwarding (L7/3). These terms refer to which techniques by which the

cluster nodes are tied together. In a L4/2 cluster, for example, every node is

identical above (OSI) layer two.

Every type of cluster deploys a dispatcher, to which all connections to the Web

server are initially sent. The servers appear to be a single-server because of this

dispatcher. The dispatcher can either be a switch, where it processes incoming data

only, or a gateway, where it processing both incoming and outgoing data. Each

server will be employing stand-alone Web server software and need not be aware

of the states of other server nodes, as Web requests save no state information.

2.3.1 Layer 4/2 Clustering

In L4/2 clustering, the network-layer address is shared by the dispatcher and all of

the servers in the pool through the use of primary and secondary IP addresses. All

the servers in the cluster pool have the cluster address as a secondary address. This

cluster address is the primary address of the dispatcher. All incoming requests for

the cluster address are addressed to the dispatcher at layer two.

When the dispatcher receives a TCP/IP connection initiation request, the dispatcher

selects a target from the pool of servers to serve the request. This is done by some

type of load sharing algorithm, usually a derivative of round-robin (Weighted

round-robin, WRR). The dispatcher then updates its tables so that future packets

arriving from that connection will be rewritten with the layer two address of the

chosen server. The original TCP/IP connection initiation request is then placed

back onto the network with the rewritten layer two address.

Page 15: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

15

The advantage of L4/2 clustering is that due to the network address of the server

being identical to that that the client originally sent its request to, that server can

reply directly to the client. Therefore the dispatcher only processes the incoming

data stream, which in Web traffic between client and server, is a small proportion

of the total communication between client and server. The dispatcher also does not

have to recomputed IP checksums in software because only layer two parameters

are modified. The drawback is that all the nodes must have a direct physical

connection to the dispatcher, due to layer two frame addressing. This is usually not

a problem, as servers in a cluster tend to be on a high -speed LAN.

L4/2 clustering is primarily limited in scalabili ty by network bandwidth and by the

dispatchers rate of processing requests, as these are the only part of the transaction

that are actually executed on the dispatcher, which is the potential bottleneck in this

architecture. There are plenty of examples of this clustering technique in the

marketplace, examples being ONE-IP [D97] and eNetwork Dispatcher by IBM

[H99].

2.3.2 Layer 4/3 Clustering

L4/3 clustering slightly predates L4/2 methods. In this architecture, every server in

the cluster has its own unique IP address. All requests sent to the cluster are

addressed to the dispatchers IP address. As above, the dispatcher decides upon a

DispatcherDispatcher

Server nServer n

Server 1Server 1replies

replies

Figure 1- Traffic Flow in a L4/2 Cluster

requests

Page 16: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

16

server to process requests for a new connection through a load balancing algorithm.

The dispatcher then rewrites the servers IP address into the packet, and sends it

back out onto the network. However the dispatcher will also have to recompute any

integrity checks that will be affected, such as CRC and packet checksums.

When the dispatcher receives a packet that is not a connection initiation, it simply

inserts the new IP address into the packet, recomputes affected integrity checks and

resends out the packet. However packets sent from the servers in the pool to the

client must also travel through the dispatcher in this architecture. The dispatcher

will rewrite the source address to its own address, and recomputes the integrity

codes before forwarding the packet to the client.

It can been observed that L4/3 clustering will perform worse than L4/2 clustering,

given the added workload imposed upon the dispatcher in L4/3 clustering. Thus the

dispatcher becomes much more of a bottleneck, due to the increased demands on its

services by the cluster. Examples of L4/3 clustering are Magicrouter [APB96] and

LocalDirector from Cisco Systems.

2.3.3 Layer 7 Clustering

The area of L7 clustering is one in which a great deal of research is currently

ongoing. These clustering techniques use information contained in the OSI layer

seven (application layer) to augment L4/2 or L4/3 dispatching. This is known as

DispatcherDispatcher

Server nServer n

Server 1Server 1

requests

replies

Figure 2- Traffic Flow in a L4/3 Cluster

Page 17: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

17

content-based dispatching since it works based on the content of the client request.

It is in this area of clustering that both Locality-Aware Request Distribution

(LARD) [APS98] and Parallel Pull-Based LRU (PPBL) [C01] fall. Both of these

clustering architectures make use of a TCP handoff, an area that must be examined

further before examining LARD and PPBL in more detail .

From the introduction to the area of clustering techniques above, we can see that

L7 clustering offers the most potential improvements into the performance of a

clustered Web server. Content-based dispatching can be used to increase cache hits

on the servers, and thus improve the overall performance of the cluster. L7

clustering will still be based upon either L4/3 or L4/2 techniques, but the added

complexity of content-based dispatching is compensated by the individual servers

in the cluster pool getting more requests that they can serve straight out of their

main memory caches.

2.4 TCP Handoff Protocol

When a dispatcher begins to examine incoming requests for content, in order to

make locality-aware assignments based on that information, we encounter some

problems that WRR avoids by simply assigning requests to the least loaded nodes,

irrespective of the content of that request. This is because in order to inspect the

contents of a request, a TCP connection must be established with the client prior to

assigning that request to a back-end server. This is true for any service (like HTTP)

that depends upon a connection-oriented transport protocol like TCP.

In order to circumvent this problem, a technique such as TCP handoff must be

used. If a TCP handoff is not employed, then we revert to a situation such as that

observed in L4/3 clustering techniques. At best we will have TCP splicing

[CRS99], where the data forwarding at the dispatcher node is done in the operating

system kernel. While this is an improvement on doing the relaying on a user-level

application, it still means that all back-end servers must send their responses to

requests through the dispatcher. The TCP handoff mechanism allows the back-end

servers to respond directly to the clients without passing through the dispatcher.

We will now examine a TCP handoff protocol in more detail .

Page 18: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

18

In [APS98] a TCP Connection Handoff protocol is introduced. The protocol is

transparent to the clients and also to the server applications running on the back-

end server. Once a connection is handed off to a back-end, incoming traffic on that

connection (mostly ACK packets) is forwarded by a forwarding module located at

the bottom of the dispatcher’s protocol stack.

A typical connection scenario is:

1) A client connects to the dispatcher with a request for a web document.

2) The dispatcher at the front-end accepts the connection, examines the

request, and then hands off that connection to a suitable back-end server.

3) The back-end server takes over the established connection.

4) The back-end server replies directly to the client.

The dispatcher must still forward all packets coming along that connection to the back-end

server, but with this handoff having occurred the dispatcher no longer needs to forward the

back-ends responses, as the back-end can now reply directly to the client.

2.4.1 TCP Handoff Protocol Problems

The problem with the TCP Handoff Protocol as outlined above is that for every

connection to the cluster, the dispatcher must effectively serialise the state of the

Client Server

TCP/IP TCP/IPTCP/IP

Dispatcher

Forward

Handoff

Conn req

ack

reply

Figure 3 - TCP Connection Handoff

Page 19: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

19

existing TCP connection, and then instantiate a TCP connection with that state on

the appropriate back-end server. This amount of work, coupled with forwarding all

current connection packets to the back-end serving that connection and making

decisions on which back-end to assign new connections to, mean that the

dispatcher in this scenario could well quickly become a bottleneck in the clustered

Web server as the amount of back-ends in the cluster increases.

Experiments were carried out in [ASDZ00] to investigate the scalabili ty of the TCP

handoff protocol described above. Their results show that this architecture does not

scale well beyond four cluster nodes. This is obviously not an acceptable limit for a

scalable clustered Web server. This problem cannot be easily overcome by simply

adding more front-end dispatchers. Adding more dispatchers will i ntroduce load

balancing problems on the front-end dispatchers, and secondly most content-aware

distribution strategies (like LARD) need centralized control for their distribution

policies.

[ASDZ00] goes on to suggest a scalable solution to the problems associated with

handing off TCP connections. Examining the dispatcher more thoroughly, we can

see that it is in fact doing two services: 1) dispatching the requests to the back-end

servers, and 2) interfacing with the client connection and handing off these

connections to the back-end nodes. 1) can be clearly termed the dispatcher, as this

is the task that implements the request distribution strategy. 2) can be termed the

distributor; it is essentially handling the TCP handoffs.

As noted in [ASDZ00], it is this distributor that is restricting the scalabili ty of the

cluster, by creating a bottleneck at the front-end node. If this distributor can be

assigned to the back-end servers, then the bottleneck of the front-end node is

solved. Their experimental results show that the processing overhead for handling a

typical TCP handoff is nearly 300µsec, whereas the dispatcher only requires

0.8µsec. Thus distributing the distributor over the back-end nodes will resolve the

bottleneck at the front-end of the cluster. Their experimental results show that a

centralised dispatcher implementing the LARD policy can service up to 50,000

connections/second on a 300Mhz Pentium II machine. This is over an order of

Page 20: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

20

magnitude greater than the connections/second performance of clusters

implementing a TCP splicing or handoff architecture as defined in the previous

section.

As we can see from the architecture suggested in [ASDZ00], the allocation of the

various parts of the clustered Web server has changed. We now have a switch

accepting requests from clients. This is a layer 4 switch, and can thus be a scalable

hardware based switch product. Note that this switch behaves like the front end of a

WRR clustered Web server, all i t does it distribute requests to back-end servers

based on layer 4 information. Therefore at this stage no TCP connections need to

be established.

As each connection is a completely separate entity, every connection can be

distributed as appropriate for load balancing measures. Once the back-end server

receives a connection request, it behaves according to the original TCP handoff

protocol, once the destination of the connection has been determined by the

centralised dispatcher. This has the added advantage that roughly 1/kth (where k is

the total amount of nodes in the cluster) of all requests will not need to have a TCP

handoff operation, as they will be at the back-end assigned to them.

In this cluster architecture, a typical connection scenario would look like this:

1) The client uses TCP/IP to connect to the chosen distributor (chosen by the

switch)

Dispatcher

Server

Distributor

Server

Distributor

Switch

Figure 4- A distributed Distributor Design

Page 21: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

21

2) The distributor component accepts the TCP connection, and parses the

clients request.

3) The distributor contacts the dispatcher to learn which back-end will serve

the request.

4) The distributor hands off the TCP connection to the appropriate back-end

server.

5) The server takes over the connection, and replies directly to the client.

6) The switch is notified of the handoff, and any subsequent TCP

acknowledgements are forwarded directly to the appropriate back-end

server.

This proposed architecture solves the bottleneck problem we observed above, when

the dispatcher and distributor are both located on the front-end of the cluster. This

architecture is a reasonable one for any content-based distribution algorithm, where

the TCP handoff problem needs to be addressed. It also has the advantage of being

imminently scalable, as each of its components is replicable, apart from the

dispatcher component. However further on in this paper we will examine PPBL,

which proposes a distributed dispatcher, and the implications of this will have to be

evaluated in the context of the proposed architecture proposed above.

2.5 Clustered Web Server Architectures

2.5.1 Weighted Round-Robin (WRR)

This is the commonly used distribution policy used by state-of-the-art commercial

clustered Web servers. The front-end dispatcher in this architecture is simply a

Layer 4 switch that distributes requests to the back-end nodes based purely on load

balancing requirements. The algorithm used is a improved variant of Round-Robin,

where the weighting is determined by such factors as current CPU load and disk

reading activity, current connections to each node and other factors that vary

depending its weighting policies. The front-end dispatcher achieves good load

balancing and low idle time rates on the back-end servers. However, as the front-

end dispatcher does not open a connection to the client, it cannot make request-

based decisions. This means that the TCP handoff problem is avoided, but at a cost.

Page 22: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

22

Before assigning requests to back-end servers, the front-end dispatcher must have a

dynamically evaluated weight associated with each back-end server that is

proportional to the server load state. This weight is re-evaluated periodically. The

three main factors that are taken into account are the loads on CPU, disk and

network resources. Typical measures to estimate the load are the number of active

processes on the server, mean disk response time and hit latency time, that is to say

the average time each request is taking on to complete on that server. These are the

factors that a WRR cluster takes into account while calculating load balancing.

WRR clusters have a very high cache miss ratio, as locality is ignored by the front-

end dispatcher. This means that the throughput of the cluster is limited by disk

accesses, caused by cache misses. The effective size of the cluster cache is that of a

single node in the cluster, independent of the number of nodes in the cluster. This is

because each cache is completely independent of every other cache in the cluster,

and no attempts are made by the cluster to improve the hit rate in these back-end

caches. This has the interesting result of the size of each node’s cache being a key

parameter in how the cluster performs.

The WRR cluster architecture is easily scalable, being bound only by the speed of

the switch at the front-end dispatcher of the node. Another interesting problem

would be when services other than static Web document requests are required. If

the cluster is offering dynamically generated Web documents, or services such as

streaming video, the usefulness of the WRR clustered Web server becomes limited.

Without a priori knowledge of the type of request issued, intelli gent decisions

about which node to assign this request to cannot be made. Thus specialized nodes

that are dedicated to a certain type of request are of limited use in this environment,

as they will be used in a inefficient way (i.e. switch assigns request to back-end

node, which then communicates with the specialised service provider, and relays

the output of that service provider to its client. As the uses of the Web server

become more diversified, and more services are being offered to clients, it can be

foreseen that the WRR approach can be less useful. However it must be noted that

the WRR clustering approach is a robust, imminently scalable approach, limited

Page 23: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

23

only by the capabili ties of the front-end dispatcher, that will always provide evenly

balanced response times to all client requests to the cluster.

2.5.2 Harvard Array of Clustered Computers (HACC)

The HACC represents an approach to a clustered Web server [ZBCS99]. We will

explore the architecture of this system, and examine its advantages and drawbacks.

The HACC cluster is designed to cater for locality enhancement, aiming to improve

the relative size of the clustered Web cache (by localising requests to back-end

nodes that have already served requests of this nature), load balancing and ease of

scalabili ty.

The HACC system employs a “Smart Router” at the front end of the cluster. This

Smart Router replaces the Layer 4 switch that would be used in a WRR approach to

a clustered Web server. This Smart Router in their implementation can be seen to

equate to a front end consisting of a centralised dispatcher and distributor. The

Smart Router can be divided into two layers, the High Smart Router (HSR) and the

Low Smart Router (LSR). The LSR equates to the low-level kernel resident part of

the system, while the HSR represents the high-level application layer decision

making part of the system. This is intended to separate mechanism (LSR) from

policy (HSR).

The LSR is responsible for TCP/IP connection set up and termination, for

forwarding requests to back-end servers and for forwarding documents back to the

clients from these back-end servers. The LSR module can be clearly equated with

the distributor module defined in the section above. This module is primarily

concerned with performance, as it is on the critical path of every request handled

by the cluster, and if its performance is not satisfactory, then this module will

quickly become a bottleneck for the system.

In their design, the LSR is implemented as a Windows NT device driver that

attaches to the top of the TCP transport driver. The LSR listens on the well-known

Web server port for connection requests. When a connection request is received,

the URL is extracted from the connection request, and passed up to the HSR. The

Page 24: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

24

LSR then enqueues all data from the incoming connection and waits for the HSR to

indicate which node to forward this data to. The LSR will continue to ferry data

back and forth between the client and the target back-end server until the

connection is closed.

The duties of the HSR mean it can be identified as a dispatcher. It is responsible for

monitoring the state of the document store and the state of the back-end servers in

the system. It uses this information to make decisions about how to distribute

requests over the HACC cluster. In [ZBCS99] they have implemented two

dispatcher algorithms, one designed to handle requests for static files and one for

the document store used by Lotus Domino, which is intended to represent an

instance of a Web Application Server (WAS), which generate documents on the fly

and require a great deal of compute power on the back-end servers.

The HSR will assign requests to back-end servers in a locality-aware fashion,

attempting to improve the overall performance of the cluster by doing so. In the

case of static files, this is simply a matter of selecting the server that has been

assigned the task of serving requests for that document. In the case of WAS type

requests, the issue is more complex. For Lotus Domino requests, assumptions can

be made about the request. Lotus Domino requests contain both a Notes Object and

a requested action in the URL. The HSR thus attempts to forward requests to back-

end servers based on these two parameters.

Load balancing is also carried out by the HSR module. It collects load statistics

from each back-end server at periodic intervals. In their prototype the performance

metrics utili sed are CPU utili sation and bytes transferred to/from disk per second.

When the HSR receives requests for data as yet unassigned to a specific back-end,

it assigns them to the least loaded back-end. The HSR also attempts to offload a

proportion of the documents of an overloaded node onto the least loaded node

when such a situation occurs. This is calculated by the load of the overloaded back-

end server exceeding that of the least loaded server by a certain amount.

Page 25: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

25

HACC claims to a more scalable cluster architecture than LARD, which we will

examine later in this paper. It claims to be able to handle WAS type requests, and

so is more suited to a situation where dynamic requests are generated in the system.

This is true to a certain extent. There are several disadvantages to this system that

we will now explore.

As we explored earlier, in a scalable clustered Web server, the front-end

dispatcher/distributor module can easily be overloaded with work, and become a

critical bottleneck to the system. This is certainly the case in the HACC design. As

the results of [ASDZ00] show, this architecture will quickly reach its limit of

throughputs per second, as the Smart Router becomes a bottleneck for the cluster.

This was seen to occur with only four back-end servers, and adding more back-end

servers does not increase the throughput of the cluster. However, as is suggested in

[ASDZ00], the distributor (LSR) could be distributed over the back-end servers,

thus freeing up this bottleneck.

The issue of WAS requests and static requests is explored in this architecture. The

authors identify that it is a non-trivial task to determine how to combine a set of

performance statistics into a single metric that reflects a back-end server’s real load

in the context of WAS requests. This problem becomes more complicated when a

cluster is dealing with a variety of requests for static and dynamic documents. In

their design, the HSR appears to handle one type of documents, or the other. The

problem is that requests for dynamically generated documents cannot be easily

partitioned for locality of reference as static documents can.

Dynamic requests can only sometimes offer the dispatcher information that can be

used to make content aware distribution decisions. The problem is that it is a very

broad area, and every type of WAS will have different request types, and these will

not necessarily be interpretable by a dispatcher in any useful, content-aware way.

The HACC system suggests dedicating back-end servers to these type of WAS,

which are CPU intensive. For example in one of their scenarios the trace file

processed contains a mixture of static file and ASP requests. They suggest

Page 26: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

26

separating ASP requests from static requests, and sending them to different back-

end servers.

Their results show that their architecture is most suited to dealing with requests of a

dynamic nature, largely because the overhead of the Smart Router is not as much of

an issue when the requests incur a CPU compute on the back-end servers. Their

architecture does not appear to be satisfactorily scalable, especially when a mixture

of dynamic and static requests are considered. The solution they suggest for this

scenario is at best an ad-hoc solution, and while it would work, would require

much experimentation on the cluster to achieve the appropriate balance of back-end

servers that can handle either static or dynamic requests.

The HACC cluster architecture thus has introduced some interesting problems into

the area of clustered Web servers. We can see that the issue of dynamic requests,

and how to handle them is a non-trivial one. We have seen that dynamic requests

bring an interesting problem to content-aware distribution policies. If the WAS

requiring dynamic requests can provide information that the distribution policy can

interpret from the URL of such requests, then it can make some attempt to

distribute the requests in a locality-aware way. However this information is not

always available. Possible solutions to the problem appear to be either to have

dedicated back-end servers to deal with these type of requests or to allow all back-

end servers to handle these requests.

2.5.3 Client-Aware Dispatching Algorithm (CAP)

The Client-Aware Policy (CAP) is a dispatching policy motivated by a desire to

improve load sharing in Web clusters that provide multiple services such as static,

dynamic and secure information [CC01]. This is an interesting approach to dealing

with the problem of diverse services being offered by a clustered Web server, and

is worth looking into in some detail.

The authors of [CC01] argue that as the heterogeneity and complexity of services

and applications provide by Web sites continuously increases, so too must the

clustered Web server adapt to deal with these increasing demands. They argue that

Page 27: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

27

content-aware dispatching policies that wish to handle requests in a heterogeneous

and dynamic system require expensive mechanisms for monitoring and evaluating

the load on each back-end server, as they gather results, combine them into

meaningful measurements and make real-time decisions based on these

measurements. They propose a dispatching policy based only on client requests.

The client-aware policy depends upon classifying incoming requests. In [CC01]

they classify Web services into four categories.

Web publishing: Sites that provide static information and dynamic services that do

not intensively use server resources. The content of dynamic requests is not known

at the instant of a request, but the results are generated from database queries

whose arguments are known beforehand.

Web transaction: Sites that provide dynamic content generated from possibly

complex database queries. This is a disk bound service as it makes extensive use of

disk resources.

Web commerce: Sits that provide static, dynamic and secure information. Security

may be necessary, and so use of the SSL protocol is expected. Web commerce

services are disk and/or CPU intensive.

Web multimedia: Sites providing streaming audio and video services. These

services are not considered in the paper as these services are mostly offered by

specialised servers and network connections.

The basic idea behind CAP is that although the dispatcher cannot estimate the

precise amount of time a request will take, especially when requests are of a disk-

bound or CPU intensive nature, it can distinguish the class of the request, and its

impact on Web server resources. CAP classifies the above classification into four

classes: static and lightly dynamic Web services (N), disk bound services (DB),

CPU bound (CB) and disk and CPU bound (DCB). The dispatcher recognises

incoming requests as belonging to one of the classes above (as this policy is a

content-aware policy), and maintains a circular list of assignments for each class.

The dispatcher thus works in a Round Robin fashion, attempting to keep the loads

of the back-end servers balanced by assigning requests of the various classes to the

back-end nodes in a circular fashion.

Page 28: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

28

This is an interesting approach to the problem of dynamic and varied Web services

being provided by a clustered Web server. CAP attempts to balance the load over

the back-end nodes in a more intelli gent, proactive (rather than the reactive

approach taken in WRR for example) manner. CAP has the advantage of not

requiring hard tuning of parameters required by most dynamic policies. As we

argued above, this is a valid diff iculty facing implementations of clustered Web

servers. A cluster where a central dispatcher makes load balancing decisions based

on parameters collected from the back-end servers at regular intervals, and attempts

to exploit locality in its cluster by partitioning the work set, will encounter real

problems when Web services that demand high CPU and disk activity are

introduced. The problem is such policies will require a high amount of tuning of

parameters to be able to work correctly. The CAP system does not require this,

once the requests have been classified correctly.

A drawback of this system is that requests that normally could be dealt with

quickly will not, in a system employing CAP. This is because CAP is strongly

concerned with load balancing, and while it is a content-aware policy, it chooses to

not use this information to its full potential. Clustered Web servers employing CAP

will have to deal with the TCP handoff problem, and the additional overhead that

adds to the system. CAP can be more equated with WRR than other content-aware

distribution strategies in fact. While CAP incurs the penalties involved in a Layer 7

dispatcher, it only concerns itself with load balancing. While WRR does this in a

reactive fashion, CAP uses classifications of requests to attempt to balance the

loads of the back-end nodes in a proactive manner. While the CAP introduces some

interesting issues into the area of dispatching polices, it is doubtful whether this is

the “silver bullet” to a clustered Web server providing multiple Web services.

2.5.4 Locality-Aware Request Distribution (LARD)

The locality-aware request distribution (LARD) strategy [APS98] examined below

is a form of content-based request distribution, focusing on improved cache hits

amongst the back-end servers in the clustered Web server. The motivation behind

this strategy was a determination to create a content-aware distribution strategy,

Page 29: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

29

that took into account the service/content requested and the current load on the

back-end servers when deciding which back-end to allocate a given request to. This

strategy is similar to the HACC dispatching strategy discussed above, however the

algorithms employed for attempting load balancing and dispatching techniques are

not as dependent upon commercial products for measuring load balancing. The

challenges identified by the authors were creating a strategy that simultaneously

achieve load-balancing and high cache hit rates on the back-ends, and the creation

of a protocol that allows the front-end dispatcher to hand off established client

connections to a back-end server. The secondary of these challenges (TCP handoff)

has been explored above, and the distributed architecture suggested in [ASDZ00]

was designed with LARD in mind.

The authors of LARD acknowledge the need for clustered Web servers to be

capable of representing a cache that is greater than that on each of the back-end’s

main memory cache. This is because as the demands of what a Web server can do

increase, so to do the demands on the caching of that Web server. Round-robin

distribution effectively limits the cluster’s working set to that that can fit into a

single main memory cache. With LARD the effective cache size approaches the

sum of the back-end’s cache sizes. Thus adding extra back-end servers to a cluster

of this nature can effectively increase the working set of that cluster as well as

accommodating increased traffic (by having additional CPU power).

The basic LARD algorithm works as follows. The dispatcher maintains a one-to-

one mapping of targets to back-end servers. When a request comes in for a new

target, it is assigned to the least loaded node in the cluster. Load in LARD is

Front EndAACAABCB

AAAA

CBCB

BackEnd

BackEnd

Figure 5- Locality-Aware Request Distribution

Page 30: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

30

measured by the amount of current connections to a back-end server that have not

yet been completed. This is quite a simple method of measuring load in the back-

end servers, but has the advantage of not requiring many CPU calculations to

collect and compute. Servers with idle CPU time will tend to have active

connections approaching zero, while overloaded servers will have an increasing

amount of active connections. By monitoring the number of active connections to a

back-end server the dispatcher can estimate the relative load on a back-end without

having to communicate explicitly with the back-end servers. This is in contrast

with WRR and the HACC system, where the front-end dispatcher regularly polls

the back-end nodes for data in order to compute the relative loads on those back-

ends.

LARD partitions the working set as requests are entered into the cluster,

dynamically distributing the this working set over the back-end nodes. LARD aims

to only reassign targets when there are back-ends going idle while other back-ends

become overloaded. Re-assignment will naturally result in cache misses on the new

node, and so re-assignment should not be done to balance out temporary load

imbalances. Tlow is defined as the load (number of connections) below which a

back-end is likely to have idle resources. Thigh is defined as the load above which a

back-end is likely to cause substantial delays in serving requests. Once a back-end

reaches 2Thigh a target will automatically be allocated to a lesser loaded back-end,

even if no back-end has a load of less than T low.

The front-end also limits the amount of connections allowed concurrently in the

cluster, so as to prevent the cluster from behaving like WRR as the load on all

nodes rises towards 2Thigh. To avoid this, LARD limits the total sum of connections

to the value of S, where S= (n-1) * Thigh + Tlow –1. n being the number of back-end

server nodes in the cluster. Setting S to this value means that at most n-2 nodes can

have a load >= Thigh, while no node has a value less than Tlow. S also ensures that

enough connections are admitted to ensure that all n back-ends can have a load

greater than Tlow, and still have room for limited imbalance between the back-ends

(and therefore avoiding continual re-assignment of targets on “overloaded” back-

ends).

Page 31: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

31

Basic LARD can be improved by using replication. Replication is desirable to

allow the cluster to assign more than one back-end to be served for popular

requests. LARD with replication differs from basic LARD as follows. The front-

end dispatcher maintains a mapping of a target to a set of back-end servers.

Requests are passed to the least loaded back-end in that set. When a load imbalance

occurs the dispatcher checks to see if the requested documents server set has

changed recently (within t seconds). If it has, the dispatcher selects a lightly loaded

back-end and adds it to the server set of that target. If the server set has not changed

recently (within t seconds) and has multiple servers, then the dispatcher removes a

back-end server from the server set of that target. This ensures that a target that was

popular in the past but is not anymore is not served by several back-ends.

[APS98] provides results that show considerable performance improvement on

state-of-the-art WRR. The achieved throughput with a working set that does not fit

into a single back-end’s main memory cache with LARD exceeds that of WRR by

a factor of two to four. It must be noted however that all these results are delivered

from tests with static documents. The performance of LARD with dynamic

requests was not tested in their test bed. In tests against CAP [CC01] using traces

containing 20% dynamic documents, the LARD policy falls 16 to 21% short of the

performance of CAP. This is not surprising, considering that the LARD policy as it

stands makes no real attempt to deal with dynamic documents, which will i ncrease

the load of a back-end server dramatically, without this becoming apparent to the

load balancing attempts of the front-end dispatcher. This is because the loads are

measured by the amount of active connections to each back-end, and does not take

into account that some of these requests might be of a dynamic nature.

As is suggested in [ZBCS99], LARD might be able to handle dynamic requests in a

more satisfactory way if weighting was given to the dynamic requests, so that a

back-end node serving dynamic requests will show this in its weighted load. This

methodology would however mean that extensive testing and fine tuning would

have to be done to determine how much weighting should be given to each type of

dynamic requests, as is argued in [APS98]. Perhaps a LARD-driven cluster serving

Page 32: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

32

both dynamic and static requests would need to have its back-end nodes partitioned

into sets that serve static requests and sets that serve dynamic requests. However

more research is needed into the nature of dynamic requests, and to what extent

dynamic requests can also be cacheable before the issue of clustered Web servers

serving offering traditional static Web documents and the more diverse array of

services on offer from Web Application Services (WAS) can be addressed more

fully.

2.5.5 Parallel Pull-Based LRU (PPBL)

The PPBL request distribution algorithm suggested in [C01] is an interesting

variation on the clustered Web server architectures suggested above. Whereas all

the distribution algorithms explored so far are push-based algorithms, PPBL (as its

name suggests) is a pull-based distribution algorithm. The logic behind this

decision is an attempt to distribute the dispatcher over all nodes, and by doing so to

relieve the front-end of this task. The reason that this approach is justified in [C01]

is that the architecture of their cluster is based upon a Distributed Shared Memory

(DSM) system using a Scalable Coherent Interface (SCI) memory mapped

network. This DSM should be utili sed by PPBL in order to improve the decision

making and general overall performance of the distributed dispatcher.

The PPBL dispatching algorithm introduces several new features not yet seen in

the Web server clusters investigated above. While the architecture proposed in

[C01] is designed for a clustered Web proxy cache, its architecture is equally valid

for a clustered Web Server. This is mainly due to the fact that the approach

suggested is a simplistic one, and thus did not need to be optimised for a Web

proxy cache environment. While the designers of the system envisaged the system

compressing and uncompressing files before sending them to the clients, this

scenario is not implemented in their prototype, and therefore their results are

equally valid for a clustered Web server as they are for a Web proxy cache. As we

have outlined above, the performance of the cluster’s total cache is of vital

significance of the performance of the clustered Web server as a whole, and so the

architecture they have put forward is just as valid for use in a clustered Web server.

Page 33: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

33

The architecture of the system is quite a simple one. The task of the front-end has

been once again reduced. Now that the job of dispatching is going to be distributed

over the back-end servers, the task of the front-end is much reduced. If we employ

the scalable architecture suggested in [ASDZ00], then the front-end is also not

responsible for the distributor tasks. Therefore in this architecture, it is anticipated

that the front-end will not become a bottle-neck for the system.

The system consists of a front-end and back-end nodes as usual. The front-end

manages an incoming queue of requests, and the back-end nodes each have a

section of the working set of documents that they own in a URL table and a LRU

cache of these documents. The URL table consists of the names of the documents

that the back-end owns, and a three-slot LRU providing the identifier of the last

nodes that served this request (called its serving LRU).

Incoming requests to the cluster go through a front-end as normal. The task of the

front-end is to insert the requests into an incoming queue. The back-ends search

this incoming queue for requests that are not yet found. The back-end performs a

search of its URL list to check if it owns it or not. This search is done in LRU

order. If a back-end node does own the request, then it updates the incoming

request queue entry with its node identifier. At this point the back-end node has

signalled to the front-end that it is handling this request, so the request is then

added to the work queue of that back-end node.

If a request is not found to be in the current contents of any back-end nodes, then

the front-end chooses the least loaded node, and assigns the request to that back-

end node. That node then updates its URL table accordingly. A Status table is also

maintained, containing the current loads of all the back-end nodes. This is regularly

updated by the back-end nodes. In the PPBL system, load is estimated by the

amount of data it still has to serve. The least loaded node is the one with the lowest

amount of data to serve.

If a back-end has found that it owns a request, but is overloaded (according to some

static threshold), then the node to which this request will be passed is determined

Page 34: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

34

from the URL table entries’ serving LRU. If this is empty, then the least loaded

node is chosen and added to the serving LRU. If the back-end is the least loaded

node in the cluster, then it keeps the request. Each node’s main memory cache is

run by the LRU cache replacement algorithm. This is an aspect of this architecture

that could be easily enhanced by using Greedy Dual Size or another of the more

optimised versions of LRU, and this is acknowledged by the authors of this paper.

The implementation of this architecture involves a great deal of synchronisation

between the dispatcher modules on the back-end nodes and the front-end. This is

done through a producer/consumer model using the distributed locks provided by

the DSM and its components. The incoming queue simply resides in the DSM, and

is thus accessed quickly and easily by all nodes in the cluster. Each client

connection is handled by a dedicated thread in the front-end.

Every incoming queue entry has one start and one end lock. The request is first put

in the queue. Then the start lock is released to make the backends start the lookup.

Then the dedicated thread waits for the end lock for lookup completion. This is the

rendezvous with the back-end nodes. If none of the back-ends claim the request,

then it is added to the least loaded node as mentioned above. Once the request has

either been claimed or assigned, then the request can be removed from the

incoming queue.

The backend process can also be multithreaded to allow multiple lookups. The

lookup thread first waits for a request to become available (the consumer waits for

the producer). Then the thread performs the lookup, and if it is found, the thread

inserts its node identifier in the incoming queue request, to acknowledge this. The

end of the lookup is signified by releasing the end lock.

This is the basic implementation of the architecture. The authors suggest further

improvements to optimise the performance of the algorithm. A completion queue

and index table are added to the front-end. The completion queue has the same

number of entries as the incoming queue. When a back-end has checked with a

request i in the incoming queue, it performs an atomic fetch and increment on

Page 35: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

35

completion queue i. If this value is equal to the number of back-end nodes minus

one, then this node was the last node to lookup that request, and then it is only the

duty of this node to release the end lock. This process will reduce the number of

locks required to be released per incoming queue request.

The index table is used to solve the lookup starting problem in the saturated case. If

there are more requests to check in the incoming queue, then there is no need for

each thread to wait for the start lock to be released. Therefore each node indicates

which is the next request it is going to treat, including the front end. If the index of

the back-end node is greater than the front-ends index, this means that it must block

and wait for a new request to be inserted into the i ncoming queue.

The PPBL system can be defined as a content-aware approach to clustered Web

servers. In contrast to the centralised dispatchers we have encountered in every

other system so far examined, we here have a distributed dispatcher. The TCP

handoff problem is not explored in detail in this paper, apart from commenting that

the SCI architecture could be used to improve the performance of this protocol. The

TCP handoff remains an issue that PPBL has to deal with, as the contents of

requests are examined before a decision is made as to which server node will deal

with that request. Therefore we can assume that an approach similar to that

suggested in [ASDZ00] will be taken, with distributed distributors as well as

dispatchers being implemented in the system.

The issue of load balancing has also been examined by this approach. It can be seen

to be more of a reactive approach to load balancing, rather than the proactive

attempts made by LARD, for example. However, it is interesting to note that the

load balancing evaluation is carried out on the back-end node in question, which is

something that does not occur in any of the other systems. This has the advantage

of while still being a reactive system, it is imagined that overloading will be

detected quicker in this system. This is because while other systems use a timer to

collect load information from the back-end nodes, here it is the back-end node itself

that determines when it is overloaded.

Page 36: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

36

[C01] produces some early results of the PPBL system. They report a steady

increase in the throughput per second as the number of nodes in the system

increases. However this paper does not measure the document hit rate achieved in

this system, a measurement that is vital in an efficient Web server.

2.6 Distributed Shared Memory Based Clustered Web

Servers

The past decade has seen a two-order-of-magnitude increase in processor speed,

and only a two-fold increase in disk access time. The gap between disk and

processor speeds has been steadily increasing, and will continue to do so. There

have been significant improvements in the network speeds of Networks of

Workstations (NOWs). This means that the network is now considerably faster

than disk for transferring data to and from main memory, a performance

enhancement due primarily to the fact that remote-memory accesses avoid the

expensive seek and rotational latencies associated with disk access [F96].

In a Web server, the disk access times are reduced by the use of an efficient cache.

In a clustered Web server, the Web cache becomes a distributed and perhaps

partitioned object, depending on the design of the clustered Web server. The

amount of requests for Web documents that involve a disk access could be reduced

by having a more globally-aware cache management system. In most current

clustered Web servers once a request reaches a node, that node relies solely on its

own cache to provide it with as efficient as possible a performance that reduces the

disk access times to the absolute minimum. However, a cluster utili sing a global

shared memory model could well improve on the performance of its nodes by

taking advantage of the benefits that the shared memory model provides by

fetching a document stored in a remote node’s cache rather than fetching it from

disk.

The shared memory model that we will be examining is Kaffemik [AWCJC00].

Kaffemik is a distributed Java virtual machine (JVM) that runs on a cluster of

workstations interconnected by a Scalable Coherent Interface (SCI) [I EEE93].

Page 37: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

37

Kaffemik is based upon a small area network using a Single Address Space (SAS)

system to support a shared object space. This provides us with a clustered

architecture that will support remote accesses to other nodes. [CKP98]

demonstrates that a page could be accessed over 50 times faster from remote

memory than from a local disk. As Kaffemik is implemented on SCIOS/SCIFS

[KHCP99], this is a significant and interesting statistic.

3 Simulation design and requirements

We have examined some clustered Web server architectures. Most of them are not

designed for a cluster employing a shared memory model. PPBL is the exception

here, as it is designed specifically for such an architecture, and indeed takes

advantage of specific services provided by the API of the system it is implemented

on to improve performance. These improvements are mostly involved with

synchronisation, and not with utili sing remote nodes to provide the cluster with a

more intelli gent, globally-aware caching algorithm.

A simulation was required to examine the behaviour of clustered Web server

architectures, and to examine the behaviour more specifically of the caches in the

cluster. The architectures that we decided to simulate were PPBL, LARD and a

simple RR approach. PPBL was selected because it is an architecture that is

designed for a shared memory environment, LARD because it outperforms most

other clustered Web server architectures [APS98], and a RR system to act as a

benchmark for the other two architectures. We are not interested in achieving a

realistic throughput of requests per second, but rather on examining whether the

Distributed Shared Memory (DSM) architecture is a useful one for implementing a

clustered Web server.

In all of the architectures we have examined, there has been no attempt at

managing the “global” cache in any way. That is to say that when a node decides to

process a request for a document, the fetching of that document is done either from

the node’s cache or from a local disk. Each cache is local and ignorant of the

contents of the other caches in the cluster. This ignorance is justified by LARD and

Page 38: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

38

PPBL by taking advantage of content of requests before allocating them to nodes.

Therefore a node’s cache is solely interested in a sub-set of the total working set of

that Web server. Seen in this way, it does not seem necessary to increase the

complexity of the cache management algorithm by attempting to have a more

cooperative approach to the management of the caches on the cluster.

The crux of the problem is whether a DSM based cluster architecture could benefit

from a more globally-aware caching strategy, given the great optimisations that can

be gained from using remote memory access instead of disk access. To decide upon

this, a simulation to examine the performance of the node’s caches and amount of

disk access in a clustered Web server will be implemented. This simulation will

provide us with statistics from which we can determine whether a globally-aware

caching strategy would be of benefit to a clustered Web server using a DSM.

Another possible benefit of a DSM-based clustered Web server is that not all nodes

need to have the entire working set of the Web server in their local disk space. As

long as one node on the cluster has a copy of a document on disk, then any node

can request and read this document into their own cache without too much delay.

As the working set of a Web server can be quite large, this could provide a useful

service to the clustered Web server as a whole.

4 Simulation implementation

The simulation has been implemented in the Java programming language. All the

constructs used in creating simulations of LARD, PPBL and RR have been either

been built from scratch or utili sing existing constructs from the Java API. The

primary objective of this simulation is to observe the behaviour of the caches in the

three architectures, and whether any of these designs would be suitable for

implementing a clustered Web serving using a DSM architecture.

This simulation model is attempting to investigate cache performances on the

nodes in the cluster. Therefore the overall throughput per seconds achieved in the

simulation is not of interest to us. This statistic would be of interest in a prototype

Page 39: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

39

environment, where Web server applications would be running on the nodes of the

cluster. In this simulation, the nodes only check to see if the document requested is

in the cache. We are also not concerned with the TCP Handoff problem in our

simulation, as there are no real requests (and therefore no attempted connections) to

our simulation. The simulation uses a Web trace to provide a realistic list of

requests that would be put to a Web server, and the performance of the caches in

the clustered Web server is then observed.

The basic process that runs is as follows. A trace (as discussed below in Section 5)

is processed and transformed into a file of Request objects. This file is input as a

stream of objects that is received by the FrontEnd of the cluster. The FrontEnd

implements a distribution algorithm as a thread, called DistribAlgo. The

DistribAlgo decides which Node receives the Request. The Request is

passed to the NodeManager in charge of that request, who will i n turn pass it on

to the Node. Each Node in turn implements a CachingAlgo, which also runs as

a thread on that Node. The CachingAlgo is the cache replacement algorithm.

Every Node also contains a Cache, which contains a list of the current documents

stored, and also a StatsObject for each document that has been requested at

that node. The StatsObject contains information such as how many times that

document has been requested, and how many times it was in the cache when

requested (e.g. a cache hit).

Page 40: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

40

The basic architecture of the cluster is represented by a FrontEnd through which

all Requests are processed. The number of Nodes in the cluster is configurable

as is the size of the Cache on each of the Nodes. In our simulation, the amount of

nodes contained in the cluster varies from two up to sixteen. The cache sizes have

been kept constant throughout the simulations at 1M. As we will see this produces

a small aggregate cache size for the whole cluster as compared to the total size of

the working set (see the Section 5 for more details), especially when the cluster

contains a low amount of nodes. The small cache sizes will mean that the caches

will be put under a great deal of demand, as the small cache attempts to handle the

large amount of diverse requests coming into it.

4.1 RR Simulation

The first simulation that was implemented was the RR distribution architecture,

using the GDS cache replacement algorithm on each node. This architecture was

relatively simple to implement. The FrontEnd implemented the

RRDistribAlgo. This distribution algorithm simply assigned Request objects

as they arrived in the FrontEnd in a round robin fashion. The NodeManager

DistribAlgo

FrontEnd

NodeManager

Node Cache

CachingAlgo

NodeManager

Node Cache

CachingAlgo

requests

Figure 6- Basic Simulation Design

Page 41: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

41

passes the Request to the Node, where the GDSAlgo decides which document

needs to be replaced from the cache by popping the first GDSobject (or if the

document is already in the cache, then its GDSobject is removed) from the

PriorityQueue, and then that GDSobject is replaced into the

PriorityQueue at the appropriate point depending on the value of the object, as

calculated according to the GDS formula explored in Section 2.1.6.

4.2 LARD Simulation

The LARD simulation demanded a more complex implementation than RR. This is

because of the load balancing measurements needed to run LARD successfully. As

before, the FrontEnd accepts Request objects. However as each Request is a

separate connection, the number of connections is limited to S, as mentioned above

(see Section 2.5.4). The LARD algorithm checks to see which Node has been

assigned this Request. If this Request has not been requested before, then it is

assigned to the least loaded node. This is determined from the

LeastLoadedQueue. The LeastLoadedQueue is a sorted list containing the

load balancing information about all the nodes in the cluster (current connections to

each node- in other words, all the Request objects that have yet to be processed

on that particular Node). Once the Request has been assigned a Node it is

passed onto that node’s NodeManager. Each NodeManager contains a

ConnectionQueue, an AddThread and a RemoveThread. The

RRDistribAlgo

FrontEnd

NodeManager

Node Cache

GDSCachingAlgo

NodeManager

Node Cache

GDSCachingAlgo

requests

Figure 7- Round Robin Architecture

Page 42: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

42

ConnectionQueue is the list of requests that this node has yet to process. When

the AddThread inserts a new Request into the ConnectionQueue, the

LeastLoadedQueue is updated to reflect this.

The RemoveThread takes out the first Request object from the

ConnectionQueue as they are processed by the node’s cache replacement

algorithm (this is GDS in the case of LARD). As each Request is processed by

the Cache, the relevant statistical information is inserted into that document’s

StatsObject.

The difficulty in implementing this system is configuring the correct values for

Thigh and Tlow. If these figures are too low, then LARD becomes WRR, and if they

are too high, then the system allows nodes to be overloaded while other nodes in

the cluster remain idle. These figures have to be adjusted to the LARD

implementation by trial and error, and depend upon the performance of the system

they are a part of. It is also necessary to keep a count of the total current

connections in the system. This information needs to be accurate and up-to-date, as

the system ceases to accept connections when the upper limit of S is reached. In our

simulation, this information was retrieved by polli ng the LeastLoadedQueue.

In a more realistic implementation of the LARD architecture, each node will know

FrontEnd

LeastLoadedQueue

LARDalgo

checks

NodeManager

Cache

GDSCachingAlgo

Node

AddThread

RemoveThread

ConnectionQueueupdates

Requests

Figure 8- LARD Architecture

Page 43: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

43

how many connections it has open, and this information should be easily accessible

by the LARD algorithm.

4.3 PPBL Simulation

This simulation was the most diff icult to implement. There is necessarily a lot of

synchronisation involved in this architecture due to the distributed Dispatcher

module. While this synchronisation utili ses the DSM’s synchronisation protocols in

[C01], in this simulation none of these methods are available. As synchronisation

in Java is done through the basic technique of acquiring and releasing of monitors,

this added a lot of complexity to the simulation model. The PPBL simulation

finally produced was not an optimal implementation. However, the model did

avoid deadlock, and so while slow, it still produced valid results, in that the

distributed Dispatcher did dispatch requests to the appropriate nodes, as set forth in

[C01].

FrontEnd PPBLalgo

NodeManager

Cache

LRUCachingAlgo

NodeAddWork

DoWorkWorkQueue

Requests

AddFactory

AddRequest

AddRequest

IncomingTable

CheckWork

RemoveEmpty

Figure 9- Parallel Pull-Based LRU Architecture

Page 44: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

44

In the PPBL simulation, new Request objects are sent by the FrontEnd to the

AddFactory. Here each request is inserted into an AddRequest thread, and

this thread is put into the IncomingTable. This table will contain a list of active

threads awaiting to be assigned to the appropriate node. The IncomingTable is

periodically scanned by each NodeManager’s CheckWork thread. This thread is

woken up every time a new AddRequest thread is inserted into the

IncomingTable. If CheckWork recognises a AddRequest as a request for

which it is responsible, it claims the request by inserting its node identifier into the

IncomingTable entry. This will allow the AddRequest thread to die, as it is

no

longer needed. If the AddRequest contains a request that is not yet assigned, then

once all CheckWork threads have examined it, it will assign this Request to the

least loaded (in other words, to the top of the stack) node in the

LeastLoadedQueue. Then next time the CheckWork thread examines this

entry

to see if it is completed or not, it will see that this Request has been assigned to

its Node, and it will pass that Request along.

The IncomingTable also has a RemoveEmpty thread running. This thread

activates after a certain number of entries have been inserted into the

IncomingTable. This thread will order the AddFactory and CheckWork

threads to lock. When all those threads have locked, the RemoveEmpty thread

will scan through the IncomingTable and remove entries that have been either

claimed by CheckWork threads or have been assigned to a Node and this has

been recognised by the appropriate CheckWork thread. The AddThread in

charge of the entry will have already died as its duties have been completed. Once

the RemoveEmpty thread has finished with the IncomingTable, it wakes all

the threads waiting on the stop lock of the IncomingTable, and goes back to

sleep.

Once the CheckWork thread has claimed a Request from the

IncomingTable, it passes it onto the NodeManager. The NodeManager

Page 45: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

45

then inserts it into the WorkQueue, and updates the load on this node accordingly.

The load information in the PPBL simulation is based upon the total size of

Request objects still to be served by this Node. This is the total size of all

requests that are in the WorkQueue. Each NodeManager also has a DoWork

thread running. This thread is woken when new work is inserted into the

WorkQueue. The DoWork thread passes the Request onto the LRUalgo

thread, which is the cache replacement algorithm used in PPBL. The

StatsObject for that document is then updated.

This PPBL simulation is somewhat simplified from the implementation put forth in

[C01]. However these simplifications have not affected the dispatching algorithm

of PPBL at all, but are more simplifications in the optimisations suggested in [C01]

to do with synchronisation. As was suggested in [C01], we envisage that in order to

build an efficient and scalable application, a good knowledge of the mechanisms

involved in the middleware of the DSM system is necessary.

5 Web traces, examination of input and statistics

Trace-driven simulation is used to evaluate the performance achieved with LARD,

PPBL and RR. The workload for our simulations was drawn from empirical traces,

obtained from access logs at two Web servers [WL01]. The first trace is called the

EPA trace in the rest of this paper. This is the access log of a day at a busy EPA

WWW Server. The second trace is called the SDSC trace. This is the access log of

a day at a busy San Diego Supercomputing Centre Web server. The distribution of

requests in the traces can be seen in Fig 10 and 11. In each graph the x-axis

(“Times Requested”) represents the number of times that a certain document was

requested by the trace, and the y-axis (“Total Documents”) represents the total

number of documents that were requested x times. In the EPA trace, a total of 242

documents were accessed more than 20 times, and in the SDSC trace, 208

documents were accessed more than 50 times.

Page 46: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

46

EPA Trace

0

100200300

400500

1 4 7 10 13 16

Times Requested

To

tal D

ocu

men

ts

EPA Trace

Figure 10 EPA Trace Topology

SDSC trace

0

200

400

600

1 8 15 22 29 36 43

Times Requested

To

tal D

ocu

men

ts

SDSC trace

Figure 11 SDSC Trace Topology

The EPA trace had a total of 44159 requests, which accounted for a set of 4860

documents. The total size of all 4860 documents (the working set) requested in the

trace was 248M. The SDSC trace had a total of 63080 requests, accounting for a set

of 1864 documents. The total size of the 1864 documents was 102M.

By using two traces with different characteristics we will get a good idea of how

the simulated clustered Web servers will act in general. The traces provide the

time, name and size in bytes of every request. This information is transformed into

a stream of Request objects containing the relevant name, time of the request and

the size of the request, that the simulation will process. CGI scripts were removed

from the trace before processing, as these dynamic requests would skew the results

of our simulation. This is because they require CPU time from the Web server, and

this cannot be simulated satisfactorily in our simulation. Any unique URL requests

Page 47: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

47

are also removed from the trace (URLs with a “?” in them for example, which are

supplying parameters for the Web server to act upon).

6 Evaluation of results

The simulations calculate the percentage of cache misses. In a Web server, these

cache misses will result in a disk access. As expected, the RR cache performance

deteriorates as the number of nodes increases. This is consistent with results found

in other simulations [ APS98]. LARD also performs as predicted in [APS98],

although the performance of LARD in the EPA trace does not improve as much as

expected with the increasing numnber of nodes. Overall however, the LARD

simulation can be seen to behave as expected.

SDSC Trace

0

0.1

0.2

0.3

2 4 8 12 16

Nodes

% C

ach

e m

isse

s

RR

PPBL

LARD

Figure 12- SDSC % Cache Misses

The PPBL simulation performs almost as well as LARD, and as the number of

nodes increases the percentage of cache misses of PPBL approaches that of LARD.

It should also be noted that PPBL was implemented with the sub-optimal LRU

cache replacement strategy, as put forth in [C01]. It is expected that implementing

Parallel Pull-Based Greedy Dual Size (PPB-GDS) will further improve the cache

hit rate.

These preliminary results show that both LARD and PPBL can achieve a highly

satisfactory cache hit ratio. Load balancing in the three simulations can be seen in

figure 14. This graph shows the amount of requests processed per node for the EPA

trace given a cluster of 8 nodes. The load balance for RR is perfect, as this is the

Page 48: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

48

only requirement of this Request dispatching algorithm. LARD has reasonable load

balancing, as load balancing is a requirement of the system. It can be seen however

that the load balancing of PPBL is quite skewed, with node 1 in the cluster having

received only 3140 requests, and node 8 6649 requests. This can be explained by

the approach taken to load balancing in PPBL. While it is interesting that the load

balancing decision is made by the node when it receives its request, and thus it can

push that request onto another node if it feels it is overloaded, we do not have the

centralised load balancing of either RR or LARD. LARD also has the added

advantage of being able to limit the amount of connections in the cluster at one

time, a restriction that might also benefit the performance of PPBL. However in a

realistic implementation of LARD, an examination on whether the limit placed

upon the total connections allowed in the cluster at one time restricts the

performance of the cluster, and is therefore enforcing load balancing at the cost of

throughput. The results of [APS98] would indicate that this is not the case.

EPA Trace

0

0.1

0.2

0.3

0.4

0.5

2 4 8 12 16

Nodes

% m

isse

s LARD

PPBL

RR

Figure 13 EPA % Cache Misses

Page 49: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

49

Load Balance for EPA trace

0

2000

4000

6000

8000

1 2 3 4 5 6 7 8

Node Number

To

tal R

equ

ests

LARD

PPBL

RR

Figure 14- Load Balancing Comparisons

7 Conclusions

By simulating a clustered Web server we have learnt a great deal about the issues

that govern this area. While the simulation does not indicate a great deal of

difference between the PPBL and LARD strategies in terms of measured results,

their approaches to the architecture of their clustered Web servers do show the

differences in the two systems. While PPBL is designed with a DSM model in

mind, it does introduced a great deal of complexity into the design and

implementation of the clustered Web server. Lets first look at the areas of clustered

Web servers that need to be focused upon when designing such a system.

An area that needs more research is dynamically created Web documents and Web

Application Servers. This domain needs more research into how a clustered Web

server can optimise the performance of the services it offers. One approach is to

partition the nodes of the cluster into specialist service providers. For this scenario,

a content-aware dispatching algorithm is crucial. It is in this area that the WRR

approach to clustered Web servers falls far short of the content-aware approaches.

At the moment, a WRR cluster cannot deal with a variety of services offered in a

satisfactory way, unless an approach like CAP is taken, but that approach

necessarily involves content-aware decisions. Another approach to optimising the

performance of dynamic and specialised services is to examine how these services

can be cached in a useful fashion. This approach will require more research into the

Page 50: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

50

possibili ties of caching dynamic services and perhaps the partitioning of these

services to enhance their locality.

One area that is not exploited fully in [C01] is the possibili ties that a DSM offers to

a clustered Web server in terms of caching. If remote nodes can store data in their

main memory and have that data accessible 50 times quicker than a local disk

access, then perhaps the concept of caching in a DSM cluster needs to be re-

addressed. It might be feasible to have a global cache manager, who is in charge of

the contents of the caches of all nodes. This might allow the clustered Web server

designer to present an aggregrate cluster cache that performs better than the current

system of isolated caches, which have improved performance through the use of

locality-aware decision-making in the dispatcher. One possible architecture could

be using a RR approach with a cache that attempts to aggregate all the caches into

one global cache, and therefore improves the cache performance on a per node

basis while still gaining the benefits of WRR, which are excellent load-balancing

and the lack of a TCP Handoff. Another possible improvement might be to utili se

remote-node main memory as much as possible in a DSM-based clustered Web

server. LARD or PPBL would have to be adjusted to achieve this, but this does not

seem to be too much of a change. The issue of looking for a document remotely

will only really occour when a request is swapped to another node because of

excessive demand for that document causing it to be replicated (a “hot” document)

or because a node in the cluster is either overloaded or too idle. In either of these

cases it would be feasible to try and fetch the document from the old serving node’s

cache rather than to access it from disk.

PPBL has put forth the idea of a distributed dispatcher. It has been seen in [C01]

and in simulating PPBL here that increasing the number of nodes in the cluster also

increases the number of members in the decision-making of the distributed

dispatcher. This is attempted to be solved in [C01] through extensive use of the

middleware that the clustered Web server is overlaying. However in [ASDZ00] it is

argued that a centralised dispatcher is capable of up to 50,000 connections a

second. That is far beyond any throughput achieved in the clustered Web servers

examined in this paper. It would appear that the added complexity and performance

Page 51: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

51

doubts over a distributed dispatcher would argue against this approach, especially

if the Web server will employ a distributed distributor approach to the TCP

Handoff problem.

In conclusion we have evaluated the Caching strategies for clustered Web servers.

We have explored in details several clustered Web server architectures, such as

WRR, HACC and CAP. A simulation was run to examine in more detail the effects

on each individual node’s caches, in the context of LARD, PPBL and RR as a

benchmark for the other two algorithms. We have pinpointed several key issues in

the area of clustered Web servers within the context of a DSM system. These issues

need to be researched in more detail before any implementation of a clustered Web

server using the Kaffemik system should be attempted.

Page 52: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

52

8 Bibliography

[ASAWF95] M. Abrams, C.R. Standbridge, G. Abdulla, S.Willi ams, and E.A. Fox.

Caching Proxies: Limitations and Potentials. Proceedings of the Fourth

International Conference on the WWW, December 1995.

[APB96] E. Anderson, D. Patterson, E. Brewer. The Magicrouter, an Application

of Fast Packet Interposing. Submitted to the Second Symposium of Operating

System Design and Implementation, May 1996.

[APS98] M. Aron et al. Locality-Aware Request Distribution in Cluster-based

Network Services. Proceedings of the Eight International Conference on

Architectural Support for Programming Languages and Operating Systems October

1998.

[ASDZ00] M. Aron, D. Sanders, P. Druschel, W. Zwaenepoel. Scalable Content-

Aware Request Distribution in Cluster-based Network Servers. Proceedings of the

2000 Annual Usenix Technical Conference, June 2000.

[AWCJC00] J. Andersson, S. Weber, E. Cecchet, C. Jensen, V. Cahill . Kaffemik-

A Distributed JVM on a Single Address Space Architecture. Trinity College

Dublin Technical Report, 2000.

[C01] E. Cecchet. Parallel Pull-Based LRU: A request Distribution Algorithm for

Clustered Web Caches using a DSM for Memory Mapped Networks.

[CC01] E. Casalicchio, M. Colajanni. A Client-Aware Dispatching Algorithm for

Web Clusters Providing Multiple Services. Proceedings of the Tenth International

Conference on the WWW, May 2001.

[CI97] P. Cao, S. Irani. Cost-Aware WWW proxy caching algorithms.

Proceedings of the USENIX Symposium on Internet Technologies and Systems,

December 1997.

[CKP98] E. Cecchet, P. Koch, X. Rousset de Pina. Global Management of

Coherent Shared Memory on an SCI cluster. Proceedings of the First International

Conference on SCI-based Technology and Research, September 1998.

[CRS99] A. Cohen, S. Rangarajan, H. Slye. On the Performance of TCP Splicing

for URL-Aware Redirection. Proceedings of the Second USENIX Symposium on

Internet Technologies and Systems, October 1999.

Page 53: An Evaluation of Caching Strategies for Clustered Thomas ... · An Evaluation of Caching Strategies for Clustered Web Servers Thomas Larkin A dissertation submitted to the University

53

[D97] O. Damani et al. Techniques for Hosting a Service on a Cluster of Machines.

Proceedings of the Sixth International Conference on the WWW, April 1997.

[F96] M. Feeley. Global Memory Management for Workstation Networks. PhD

Thesis, University of Washington, 1996.

[F97] A. Fox et al. Cluster-Based Scalable Network Services. Proceedings of the

16th ACM Symposium on Operating System Principles, October 1997.

[H99] G. Hunt et al. Network Dispatcher: a Connection Router for Scalable

Internet Services. Computer Networks and ISDN Systems, September 1999.

[IEEE93] IEEE. IEEE Std 1596-1992: IEEE Standard for Scalable Coherent

Interface. IEEE, August 1993.

[KHCP99] P. Koch, J. Hansen, E. Cecchet, X. Rousset de Pina. SCIOS: An SCI-

based Software Distributed Shared Memory. Proceedings of the First Workshop on

Software Distributed Shared Memory, 1999.

[SGR00] T.Schroeder, S. Goddard, B. Ramamurthy, Scalable Web Server

Clustering Technologies. IEEE Network, May/June 2000.

[TRS97] I . Tatarinov, A. Rousskov, V. Soloviev. Static Caching in Web Servers.

Proceedings of the Sixth International Conference on Computer Communications

and Networks, IC3N 1997.

[WL01] Traces obtained from http://www.web-caching.com/traces-logs.html.

[ZBCS99] X. Zhang, M. Barrientos, J. Chen, M. Seltzer. HACC: An Architecture

for Cluster-Based Web Servers. Proceedings of the Third USENIX Windows NT

Symposium. July 1999.