25
20 CHAPTER 2 LITERATURE REVIEW 2.1 INTRODUCTION The size and popularity of the WWW systems have grown dramatically in the last couple of decades. The demand on the WWW systems to provide response in quick time is also increasing. The increase in web users and web applications has lead to an increase in latency, network congestion and server overloading. Similarly, various prefetching techniques have been developed to augment the caching efforts. Cooperative caching which is an efficient technique to enhance the user experience in WWW has been extensively researched in the recent past. This chapter reviews the seminal works carried out by various researchers in the web caching in the recent past with a special focus on the information retrieval systems. More precisely, this chapter reviews the extent literature on the web caching systems, prefetching techniques. The chapter also discusses the recent research works conducted on the effectiveness of the cooperative caching techniques in reducing the user perceived latencies in web applications. 2.2 WEB CACHING The main function of a caching system is to store the popular web objects that are most likely to be visited in the near future in the client machine or the proxy server (Ali et al 2011).The performance of the web based systems can be improved by employing various web caching

CHAPTER 2 LITERATURE REVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35508/7/07_chapter2.pdf · Acharjee (2006) has listed the main advantages of web caching to the

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CHAPTER 2 LITERATURE REVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35508/7/07_chapter2.pdf · Acharjee (2006) has listed the main advantages of web caching to the

20

CHAPTER 2

LITERATURE REVIEW

2.1 INTRODUCTION

The size and popularity of the WWW systems have grown

dramatically in the last couple of decades. The demand on the WWW systems

to provide response in quick time is also increasing. The increase in web users

and web applications has lead to an increase in latency, network congestion

and server overloading. Similarly, various prefetching techniques have been

developed to augment the caching efforts. Cooperative caching which is an

efficient technique to enhance the user experience in WWW has been

extensively researched in the recent past. This chapter reviews the seminal

works carried out by various researchers in the web caching in the recent past

with a special focus on the information retrieval systems. More precisely, this

chapter reviews the extent literature on the web caching systems, prefetching

techniques. The chapter also discusses the recent research works conducted

on the effectiveness of the cooperative caching techniques in reducing the

user perceived latencies in web applications.

2.2 WEB CACHING

The main function of a caching system is to store the popular web

objects that are most likely to be visited in the near future in the client

machine or the proxy server (Ali et al 2011).The performance of the web

based systems can be improved by employing various web caching

Page 2: CHAPTER 2 LITERATURE REVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35508/7/07_chapter2.pdf · Acharjee (2006) has listed the main advantages of web caching to the

21

techniques. Acharjee (2006) has listed the main advantages of web caching

to the end users, network managers, and content creators. Some of the

advantages are that web caching decreases user perceived latency, web

caching reduces network bandwidth usage and it also reduces load on the

origin servers.

Cache replacement is the core of the web caching systems and

hence the design of efficient cache replacement algorithms are vitally

important to achieve highly sophisticated caching mechanism (Chen 2007).

Considering the importance of cache replacement algorithms, they are more

popularly called as web caching algorithms (Koskela et al 2003).

2.2.1 Web Caching Algorithms

A. Least-Recently-Used (LRU) Algorithm

Least-Recently-Used (LRU) algorithm is the simplest and most

commonly used cache management approach. LRU algorithm removes the

least recently accessed objects so that sufficient space is made available for

the new objects. LRU is easy to implement and is mostly suitable for uniform

size objects, like the traditional memory cache. However, Least-Recently-

Used (LRU) algorithm does not consider the size of an object and the

download latency of objects, so it is not suitable to use these techniques

directly for web caching (Koskela et al 2003).

B. Least-Frequently-Used (LFU)

In the case of Least-Frequently-Used (LFU) algorithm, the objects

with the least number of accesses are replaced. More precisely, LFU keeps the

more popular web objects and evicts the rarely used ones. However, the

drawback of LFU algorithm is that objects with the large reference accounts

are never replaced even if they are not requested again.

Page 3: CHAPTER 2 LITERATURE REVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35508/7/07_chapter2.pdf · Acharjee (2006) has listed the main advantages of web caching to the

22

C. SIZE Policy

The SIZE policy is one of the common web caching approaches

that replace the largest object(s) from cache when space is needed for a new

object. It causes high cache hit ratio. The drawback of this approach is that,

the cache is often filled with small objects which are not accessed again

mostly. This scheme has low byte hit ratio.

D. Greedy-Dual-Size (GDS) Policy

Cao & Irani (1997) have suggested Greedy-Dual-Size (GDS) policy

as an extension to the SIZE policy. The algorithm integrates several factors

and assigns a key value or priority for each web object stored in the cache.

When the cache space becomes full and a new object is required to be stored

in cache, the object with the lowest key value is removed. They have proved

that the GDS algorithm achieved better performance compared with other

traditional caching algorithms. When user requests an object p, GDS

algorithm assigns key value K(p) of object p as shown in Equation 2.1.

C(p)K(p) L

S(p) (2.1)

where C(p) is the cost of fetching object p from server into the cache, S(p) is

the size of object p; and L is an aging factor. L starts at 0 and is updated to the

key value of the last replaced object. The key value K(p) of object p is

updated using the new L value since the object p is accessed again. Thus,

larger key values are assigned to objects that have been visited recently. The

major drawback of the GDS algorithm is that it ignores the usage frequency

of web object.

Page 4: CHAPTER 2 LITERATURE REVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35508/7/07_chapter2.pdf · Acharjee (2006) has listed the main advantages of web caching to the

23

E. Greedy- Dual-Size-Frequency (GDSF)

Cherkasova (1998) has developed greedy dual size frequency

algorithm as enhancement to GDS algorithm. The design integrated frequency

factor into of the key value K(p) as shown in equation 2.2.

C(p)K(p) L F(p) *

S(p) (2.2)

where, F(p) is frequency of the visits of object p. Initially, when p is requested

by the user, F(p) is initialized to 1. If p is in the cache, its frequency is

increased by one. The drawback of the GDSF algorithm is that it does not

take into account the predicted assess in the future.

2.2.2 Studies Based on Web Caching

Ali et al (2012) have utilized a new approach to incorporate the

machine learning techniques like support vector machine (SVM) and a

decision tree (C4.5) classifiers with the conventional web proxy caching

techniques like Least-Recently-Used (LRU), Greedy-Dual-Size (GDS) and

Greedy-Dual-Size- Frequency (GDSF). As a result, three intelligent caching

approaches were used, and are known as SVM LRU, SVM GDSF and C4.5

GDS. The intelligent web proxy caching approaches were employed for

making cache replacement decisions. More specifically, the conventional

web proxy caching approaches were extended using machine learning to

enable the algorithms to adapt intelligently over time.

Ali et al (2012) have effectively combined the factors such as

recency, frequency, size, access latency and type of object using intelligent

classifier to predict whether the objects will be requested again in the future.

Then, this information has been effectively incorporated into the traditional

Page 5: CHAPTER 2 LITERATURE REVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35508/7/07_chapter2.pdf · Acharjee (2006) has listed the main advantages of web caching to the

24

web proxy caching algorithms to present novel intelligent web proxy caching

approaches with good performance in terms of hit ratio (HR) and byte hit

ratio (BHR).

Ali et al (2012) have also benchmarked the SVM and C4.5

classifiers by comparing with both Back-Propagation Neural Network

(BPNN) and Artificial Neuro Fuzzy system (ANFIS). They have evaluated

the intelligent approaches by employing a trace-driven simulator to meet the

requirement of proxy caching approaches. The results of the simulation were

compared with other relevant web proxy caching polices. Experimental

results have revealed that the SVM LRU, SVM GDSF and C4.5 GDS

significantly improve the performances of LRU, GDSF and GDS

respectively. Furthermore, C4.5 GDS achieved the best HR among all

algorithms across the proxy datasets. Finally, SVM LRU achieves the best

BHR among all algorithms across the proxy datasets. Lastly, SVM GDSF

achieves the best balance between HR and BHR among all algorithms across

the proxy datasets.

One of the limitations of the approaches of Ali et al (2012) is that

the classifiers have to be trained continuously to ensure effective web

caching. Another limitation is that computational overhead increases in

preparation of the target outputs in training phase when looking for the future

requests. A good alternative solution could be the use of clustering algorithms

for enhancing the performance of web caching policies, since the clustering

algorithms do not need any preparation for the target output.

Yates et al (2007) have studied the trade-offs in designing efficient

caching systems for web search engines. They have explored the impact of

different caching approaches, such as static vs. dynamic caching, caching

query results vs. caching posting lists. The data for the study consisted of

Page 6: CHAPTER 2 LITERATURE REVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35508/7/07_chapter2.pdf · Acharjee (2006) has listed the main advantages of web caching to the

25

crawl of documents from the UK domain, and query logs of one year of

queries. Yates et al (2007) have demonstrated that caching posting lists can

achieve higher hit rates than caching query answers and also suggested an

algorithm for static caching of posting lists. One of the major contributions of

Yates et al (2007) includes the design of an optimal way to split the static

cache between answers and posting lists. They have also measured the impact

of changes in the query log on the effectiveness of static caching. Yates et al

(2007) compared the performance of the their algorithm with that of static

caching algorithms as well as dynamic algorithms such as LRU and LFU and

confirmed that the new algorithm outperforms others in terms of hit rate

values.

According to Wong (2006), replacement decisions in web caching

can be affected by various factors like recency which measures the time since

the last reference to the object, Frequency of requests to an object, Size of the

web objects, Cost of fetching the object and access latency of object. The web

cache replacement policies classified based on these factors into five

categories namely, Recency-based polices, Frequency-based polices, Size-

based polices, Function-based polices and Randomized polices.

Web caches have been considered as effective tools to access the

web pages with less latency and perceived as an important mechanism to

reduced bandwidth usage. However, the explosive growth in the web related

technologies as user interactions; traditional web caching systems needed

several modifications. Several disadvantages in web caching systems were

reported in the literature. Some of the disadvantages are caching offers slower

performance if the resource isn't found in the cache, often caches store the

stale copy of a resource and supply the stale copy to the user even when the

updated copy is needed and also cache miss occurs sometimes because of data

losses.

Page 7: CHAPTER 2 LITERATURE REVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35508/7/07_chapter2.pdf · Acharjee (2006) has listed the main advantages of web caching to the

26

2.3 PREFETCHING

Web caching has been successful in reducing the network and I/O

bandwidth consumption, but still suffer from a low hit rate, stale data and

inefficient resource management(Sharma & Goel 2010). Moreover, caching

cannot prove beneficial if the web pages were not visited in the past. Hence,

web pre-fetching techniques were suggested to overcome the limitation of

web cache mechanisms through preprocessing the contents before it was

actually requested by the user. Web prefetching predicts the web objects

which are expected to be requested by the users in the near future, though

these objects are not yet requested by the users. The predicted objects are

usually fetched from the origin server and stored in a cache. In a way, the web

prefetching helps in increasing the cache hits and reducing the user-perceived

latency (Ali et al 2011). Thus prefetching in conjunction with caching can

cater the needs of the WWW users in more than one aspect. Research interest

in prefetching has been increased in the recent years. This following section

discusses the literature related to prefetching with a special focus on the

present work.

2.3.1 Types of Web Prefetching

Based on the location, prefetching techniques can be implemented

on the client side, the server side, or in the proxy side (Zhijie et al 2009). The

difference between the various prefetching techniques lies on the navigation

patterns. The navigation pattern of the client-based prefetching concentrates

on single user across many web servers. In the case of sever-based

prefetching, the navigation pattern concentrates on all users accessing a single

website. The proxy-based prefetching concentrates on the navigation patterns

of a group of users across many web servers (Ali et al 2011).

Page 8: CHAPTER 2 LITERATURE REVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35508/7/07_chapter2.pdf · Acharjee (2006) has listed the main advantages of web caching to the

27

2.3.2 Approaches to Web Prefetching

The existing prefetching algorithms can be broadly classified into

two main categories according to the data taken for prediction as content-

based prefetching and history based prefetching.

A. Content-Based Prefetching

The content-based prefetching approach predicts the future user

requests depending on the analysis of web page contents to find Hyper Text

Markup Language (HTML) links that are likely to be followed by the clients.

Some of the content-based prefetching approaches used ANN to predict future

requests depending on keywords in anchor text of URL. The keywords

extracted from web documents were given as inputs to ANN that predict

whether the URL needs to be prefetched or not. The major drawback of

content-based prefetching techniques is that it consumes high load for parsing

every web page served and hence it is not recommended for implementation

on the server side (Domenech et al 2010).

B. History-Based Prefetching

The history-based prefetching category predicts future user requests

depending on the observed page access behavior in the past. The algorithms

of this category can be classified into four approaches: approach based on

Dependency Graph (DG), approach based on Markov model, approach based

on cost function and approach based on data mining (Zhijieet al 2009).

Page 9: CHAPTER 2 LITERATURE REVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35508/7/07_chapter2.pdf · Acharjee (2006) has listed the main advantages of web caching to the

28

2.3.3 Studies on Web Prefetching Techniques

Padmanabhan & Mogul (1996) have shown a dependency graph

algorithm for prediction of next page. The nodes of the dependency graph

represented web pages while the arcs indicated that the target node is accessed

after the original node within a sliding window. The weight assigned to the

arc was used to represent the probability that the target node will be the next

one. The dependency graph based prefetching approach predicts and

prefetches the nodes whose arcs connect to the current accessed node and

have weights higher than a threshold. Although the web prefetching approach

based on DG can help in reducing the latency time, the network traffic is

increased with this approach. Another drawback of DG approach was that the

prediction accuracy was low because it examined only the pair dependencies

between two web pages.

Lan et al (2002) have developed a priori-based mining method to

deduce a rule table for predicting and prefetching the highest document into a

proxy buffer. However, too many rules are produced and maintained in the

rule table, which increases complexity.

Xu et al (2004) have used the keyword-based semantic pre-fetching

approach that predicts the future requests based on semantic preferences of

past retrieved objects, rather than on the temporal relationships between web

objects. More precisely, the semantic prefetching techniques are used to

capture the client surfing interest from their past access patterns and predict

future preferences from a list of possible objects when a new web site is

visited. Xu et al (2004) employed ANN to predict future requests depending

on keywords in anchor text of URL. The keywords extracted from web

documents were given as inputs to ANN that predict whether the URL needs

to be prefetched or not.

Page 10: CHAPTER 2 LITERATURE REVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35508/7/07_chapter2.pdf · Acharjee (2006) has listed the main advantages of web caching to the

29

Domenech et al (2010) have stressed that structure of the current

web pages has to be taken into account in predicting the web objects to reduce

the user perceived latency in prefetching techniques. They believe that the

HTML object has a noticeable number of embedded objects. As a result they

provide a Double Dependency Graph (DDG) algorithm that considers the

characteristics of the current web sites by distinguishing between container

objects and embedded objects to create a new prediction model. Their model

has a better latency reduction while decreasing the need of resources like

extra bandwidth and extra server load.

2.3.4 Clustering Based Prefetching

Pallis et al (2008) have introduced a clustering-based pre-fetching

scheme which integrates efficiently caching and prefetching approaches to

improve the performance of the web infrastructure. The main advantage of

adopting prefetching policies over a proxy cache server is that the web

content can be managed effectively by exploiting the temporal as well as the

spatial locality of objects (Pallis et al 2008) requests were

represented using a Web Navigational Graph (WNG). More specifically, a

graph-based clustering algorithm has been used to identify the clusters of

web pages based on the users access patterns. This scheme can

be easily integrated into a web proxy server, so that its performance can be

improved.

A clustering-based prefetching scheme, according to (Pallis et al

2008), can be efficiently employed to identify the clusters of correlated web

ccess patterns. The web pages may belong to

different w request, the specific clusters were

selected and fetched by the proxy cache. In addition, cache replacement

policy was used by the proxy servers to manage its content. Pallis et al (2008)

Page 11: CHAPTER 2 LITERATURE REVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35508/7/07_chapter2.pdf · Acharjee (2006) has listed the main advantages of web caching to the

30

have introduced two algorithms viz. a clustWeb algorithm for clustering inter-

site web pages, clustPref to provide clustering based short-term prefetching

scheme.

The algorithms were simulated and experimented using the web

cache traces provided by a proxy cache server (Squid). In order to evaluate

the scheme, (Pallis et al 2008) used two performance metrics namely Hit Rate

(HR) and Byte Hit Rate (BHR). The simulations results showed that the

integrated framework was robust and effective in improving the performance

of the web caching environment.

Pallis et al (2008) have stated that the application of clustWeb

algorithm can be extended to areas like discovering usage patterns and

profiles, detecting copyright violations, and reporting search results.

Similarly, the efficiency of the pre-fetching scheme can be compared with

other clustering algorithms.

Sharma et al (2009) have introduced a clustering approach based on

rough set clustering which was used to form the clusters of sessions. In this

approach, data acquired from past experience was classified as uncertain,

imprecise or incomplete information. Using rough set clustering, only

meaningful sessions are obtained in which user spends their quality time. The

authors have developed RST algorithm based on the concept of rough sets to

calculate equivalence between objects and then finds lower approximation

and upper approximation. Lower approximation is the union of all

equivalence objects which are contained in the target set, which is generally

supposed by the user. The upper approximation is the union of all equivalence

objects which have non-empty intersection with target set.

Page 12: CHAPTER 2 LITERATURE REVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35508/7/07_chapter2.pdf · Acharjee (2006) has listed the main advantages of web caching to the

31

Sharma et al (2009) have also developed the Prediction Prefetching

Engine (PPE) concept which can reside at a proxy server. The function of

PPE is matching the requests of user for web pages with existing rough set

clusters and then decides whether to prefetch the page or not. One of the

advantages of clustering RST is that it feeds only meaningful sessions of web

log to rule generator phase of PPE. Hence the complexity of PPE is also

reduced.

Ahmad et al (2011) have presented an optimized predictive

prefetching technique based on clustering. The clusters of similar pages is

created using web log data file and then prediction algorithm is employed on

these clusters. The authors have optimized the prediction by taking into

account the frequency of each predicted cluster to calculate the percentage of

each web object. The frequency of each page usage in each cluster can be

determined using association rules. Ahmad et al (2011) also compared the

results with the existing technique of Yang et al (2001). Overall performance

was calculated by the summation of percentage of each web object and the

results showed that the results with higher probability of prediction accuracy.

Sathiyamoorthi & Bhaskaran (2012) have developed a clustering

approach based on a modified ART1 neural network to pre-fetch web pages

into the proxy cache. The modified ART1 clustering algorithm to group users

based on their web access patterns. The advantage of using the modified

ART1 algorithm is that it adapts to the change in users web access patterns

over time without losing information about their past web access patterns.

Sathiyamoorthi & Bhaskaran (2012) have conducted several

experiments to empirically compare the modified ART1 based clustering

approach with the existing ART1 based pre-fetching technique. The metric

used in the comparison was average inter and intra cluster distance for all the

Page 13: CHAPTER 2 LITERATURE REVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35508/7/07_chapter2.pdf · Acharjee (2006) has listed the main advantages of web caching to the

32

data sets. Overall results of the experiment indicated that the hit rate of the

cache increases in their approach and hence the user perceived latencies was

reduced to a great extent.

Poornalatha & Raghavendra (2012) have suggested a new approach

which integrates the work of Hay et al (2004) on clustering with distance

measure technique and that of (Poornalatha & Raghavendra 2011a) on

sequence alignment in computing similarities between the clusters. In that

model, first user sessions were created based on various factors like IP

address, date and time using the web access logs. Modified k-means

algorithm (Poornalatha & Raghavendra 2011b) was used to create the cluster

of the sessions. When a user makes requests for a web page, the nearest

cluster to the requested page is determined by measuring the distance with all

cluster centers. Then the next page in the cluster is retrieved. In addition, the

number of sessions in which the next page is present followed by the

requested page in the cluster is also counted. Based on frequency, top n pages

were selected for the prediction list (Poornalatha & Raghavendra 2012).

Ali et al (2011) have highlighted the major drawback in the

prefetching enhanced systems by noting that some prefetched objects might

not have been actually requested by the users. In such cases, the prefetching

scheme eventually increases the network traffic as well as the load on the web

server. In addition, cache space is not used optimally. Hence they suggested

that the prefetching approach should be designed carefully in order to

overcome these limitations.

2.4 INTEGRATING WEB CACHING AND PREFETCHING

TECHNIQUES

The performance of the WWW system can be significantly

improved by integrating web proxy caching and prefetching techniques in a

Page 14: CHAPTER 2 LITERATURE REVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35508/7/07_chapter2.pdf · Acharjee (2006) has listed the main advantages of web caching to the

33

proper manner. The integration can enhance the performance of web by

exploiting the temporal locality in the web proxy caching and spatial locality

in the prefetching of the web objects. In addition, the combination of the

caching and the prefetching helps in improving hit ratio and reducing the

user-perceived latency. Basically, the web prefetching requires two steps:

anticipating future pages of users and preloading them into a cache. This

means the web prefetching involves also the caching. However, the web

caching and prefetching are addressed separately by many researchers in the

past. It is important to take into consideration the impact of these two

techniques combined together. Few studies were discussed integration of web

caching and web prefetching together.

One of the earliest studies on integrating web caching and

prefetching was carried out by Kroeger et al (1997). Interestingly, they

studied the effect of combining caching and prefetching on end user latency.

Finally, they concluded that the combination of web caching and prefetching

can potentially improve latency up to 60%, whereas web caching alone

improves the latency up to 26% only.

Yang et al (2001) have advocated mining the web logs to obtain the

web-document access patterns. These patterns were further used to extend the

GDSF caching policies and prefetching policies.

Teng et al (2005) have suggested a cache replacement algorithm for

Integrating Web Caching and web Prefetching in client-side proxies (IWCP).

Using a normalized profit function, they evaluated the profit from caching an

object according to some prefetching rule.

Ibrahim & Xu (2004), Acharjee (2006) have used ANN in both

prefetching policy and web cache removal decision. This approach depends

on the keywords of URL anchor text to predict the user's future requests. The

Page 15: CHAPTER 2 LITERATURE REVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35508/7/07_chapter2.pdf · Acharjee (2006) has listed the main advantages of web caching to the

34

most significant factors (recency and frequency) were ignored in web cache

replacement decision. Moreover, since the keywords extracted from web

documents are given as inputs to ANN, applying ANN in this way may cause

extra overhead on the server.

Balamash et al (2007) have analyzed the effects of integrating the

web caching and prefetching techniques and developed a mathematical model

to established the conditions under which prefetching reduces the average

response time of a requested document. The model accommodated both

passive client and proxy caching along with prefetching. They have

contended that prefetching never degraded the effectiveness of passive

caching and advocated that both can coexist in the same system. As an

outcome of the analysis, (Balamash et al 2007) developed an expression for

the prefetching threshold that can be set dynamically to optimize the

effectiveness of prefetching. They also introduced a prefetching protocol

based on the analytical results for optimizing the prefetching gain. A number

of investigations were made to study the effect of the caching system on the

effectiveness of prefetching.

The study observed that the high variability in web file sizes has

limited the effectiveness of prefetching. One of the assumptions of (Balamash

et al 2007), is that each client runs one browsing session at a time. Although,

one-session assumption is generally acceptable for clients with low-

bandwidth connections, further work is necessary to study the impact of

clients working with multiple sessions on high-bandwidth connections.

Jin et al (2007) have suggested a set of algorithms for integrating

web caching and prefetching for wireless local area network. As a part of the

work, they have developed a sequence mining based prediction algorithm,

context-aware prefetching algorithm and profit-driven caching replacement

policy.

Page 16: CHAPTER 2 LITERATURE REVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35508/7/07_chapter2.pdf · Acharjee (2006) has listed the main advantages of web caching to the

35

Sulaiman et al (2009) have developed a framework for combining

web caching and prefetching on mobile environment. Using the combination

of Artificial Neural Network (ANN) and Particle Swarm Optimization (PSO)

for classification web object, they have developed a hybrid technique (Rough

Neuro-PSO). Then, rough set technique was used to generate rules from log

data on the proxy server. In prefetching side, prefetching approach based on

XML was suggested for implementation on mobile device to handle

communication between client and server.

2.5 COOPERATIVE CACHING

The major drawback in the single proxy systems is the low hit rate

which can be augmented by coordination among the caches. Hence

cooperative web caching was developed as an advanced caching scheme to

achieve effective and efficient cooperation among caches. The main goals of

the cooperative caching mechanisms are to reduce the load on the server and

to improve client-perceived latencies. According to Anderson et al (1996),

sharing and coordination of cache state among multiple communicating

caches using cooperative caching, has been shown to improve the

performance of file and virtual memory systems in a high-speed, local-area

network environment.

2.5.1 Cooperative Caching Mechanisms

A number of studies were conducted to determine an efficient

mechanism for cooperative caching in web applications. Some of the salient

features of the most common mechanisms in cooperative caching like

hierarchical caching, distributed caching and hybrid caching are discussed in

this section.

Page 17: CHAPTER 2 LITERATURE REVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35508/7/07_chapter2.pdf · Acharjee (2006) has listed the main advantages of web caching to the

36

In hierarchical caching, there are three levels of caches, such as

client, proxy and server levels. A proxy cache can be considered as the parent

of some client caches, server cache can be considered as the parent of some

proxy caches. A client is often connected to any one of the three levels of

caches, then that cache becomes a default one for the client. When a request

from the client for data is not satisfied by the default cache, it is redirected to

the parent cache and the parent cache can in turn forward its unsatisfied

requests to its parent cache. If the document is not found at any cache level,

the upper level proxy cache contacts the original server directly. When the

document is found, either at a cache or at the original server, it travels down

the hierarchy, and each of the intermediate caches along its path makes the

decision whether a copy of the document should be cached locally or not,

based on the cache content update algorithm used (Chankhunthod et al 1996).

There are no intermediate caches in a distributed caching. Hence

whenever a miss is encountered, client and intermediate caches rely on other

mechanisms to retrieve a miss document. Some of the mechanisms included

Inter Cache Protocol ICP, Cache Array Routing Protocol, Summary Cache

and Cache Digest (Povey& Harrison 1997, Tewari et al 1999).

Finally, in case of hybrid caching, caches may cooperate with other

caches at the same level or at a higher level using distributed caching so that

the document is fetched from a parent or neighbor cache that has the lowest

round trip time (RTT) (Rabinovich et al 1998).

2.5.2 Cooperative Caching Algorithms

Dahlin et al (2004) have reviewed the various cooperative caching

algorithms and presented the results of the comparison of four algorithms

namely Direct Client Cooperation, Greedy Forwarding, Centrally Coordinated

Caching and N-Chance Forwarding.

Page 18: CHAPTER 2 LITERATURE REVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35508/7/07_chapter2.pdf · Acharjee (2006) has listed the main advantages of web caching to the

37

In a Direct Client Cooperation, each active client uses the idle

che

overflows, it forwards the cache entries directly to an idle machine. The active

client can then access the private remote cache of the idle machine to fulfill its

read requests until the remote machine becomes active and evicts the

cooperative cache (Dahlin et al 2004).

In a Greedy Forwarding approach, the cache memories of all clients

in the system are considered as a global resource that can be accessed to fulfill

lacks in providing coordination

among the contents of the caches memories which ultimately causes

unnecessary data duplication (Dahlin et al 2004).

A Centrally Coordinated Caching scheme improvises the greedy

algorithm by adding coordination among the cache memories. Here

managed greedily by that client, and a globally managed section, coordinated

by the server as an extension of its central cache (Dahlin et a1 2004).

Centrally Coordinated Caching has the high global hit rate because of its

global management of the bulk of its memory resources. The main drawbacks

are that the clients local hit rates has been reduced since the local caches are

effectively made smaller and also that the central coordination may impose

significant load on the server (Dahlin et a1 2004).

In the case of N-Chance Forwarding, the fraction of each clients

cache managed cooperatively is adjusted dynamically depending on the client

activity. In a way, Greedy Forwarding algorithm has been modified by the N-

Chance algorithm to have clients cooperate to preferentially cache singlet

which are blocks stored in only one client cache. Thus, N-Chance Forwarding

is that it provides a simple dynamic trade-

Page 19: CHAPTER 2 LITERATURE REVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35508/7/07_chapter2.pdf · Acharjee (2006) has listed the main advantages of web caching to the

38

cache data and the data being cached in the overall system. A major

disadvantage of N-Chance Forwarding approach is that it produces an

unnecessary system load because a block may be bounced among the multiple

caches while living in the cooperative portion of the caches (Dahlin et a1 2004).

2.5.3 Studies Based on Cooperative Caching

Khalil & PeiQi (2007) assert that the most of the existing

cooperative caching protocols did not examine how to select the best proxy

server that would offer the best response time to a web client. They suggested

that a fastest server can enhance the file transfer time when compared to

conventional cooperative proxy mechanisms where the selection of server is

done randomly. Hence, (Khalil & PeiQi 2007) have utilized a technique to

facilitate efficient server selection by dynamically measuring the data transfer

rates between proxy servers. The primary objective of the study was to

improve the proxy to proxy file transfer time by selecting fastest possible

available proxy from a pool of cooperative proxies (Khalil & PeiQi 2007).

The performance of the Fastest Free Server (FFS) strategy was

compared to conventional Random selection (RS) strategy. The results

indicated that FFS offers better performance in terms of mean response time

(Khalil & PeiQi 2007). They also studied the performance benefits of

integrating efficient server selection mechanism and showed that strategy of

FFS selection can be hugely beneficial as file transfer time can improve

significantly if chosen as the preferred method.

The work of Khalil & PeiQi (2007) can be further investigated by

including real life web traces to analyze the effectiveness of server selection

strategies. Similarly, the efficiency of the FFS scheme has to be validated in a

dynamic network environment with real cache networks data.

Page 20: CHAPTER 2 LITERATURE REVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35508/7/07_chapter2.pdf · Acharjee (2006) has listed the main advantages of web caching to the

39

Chen et al (2008) have introduced a hybrid and cooperative

browser-level web caching system based on chord. In that system, each node

in the system can contact with each other for web cache sharing. The

advantage of the system is that when a miss occurs in local web cache and

web cache server, the request is automatically sent to a remote node in the

chord or its web cache server. Hence the effective workload of server

decreases and hit ratio is increased because of resource sharing between proxy

server web caches (Chen et al 2008).

Initially, (Chen et al 2008) have designed the hybrid browser-level

web caching system based on peer-to-peer system chord. Then using the

improved resource searching algorithm, the system was extended to the local

web cache and web cache server level systems.

Chen et al (2008) have evaluated the performance of the system by

performing hit ratio simulation, comparing the hits in local web cache, hits in

web cache server and hits in chord web cache server for various web cache

sizes. The results showed that the system achieved a better hit ratio in

comparison with other two systems.

Baek et al (2009) have designed a new object management policy

that can be applied in the hybrid architecture for cooperative caching. The

new policy has the provision for discarding the web objects that are not likely

to be accessed by clients. This approach employed predictive technique using

table of rules derived from actual experience of the web object requests. It

also employed summary tables in each proxy cache to limit number of

executions of the expensive predictions.

In the object management policy by Baek et al (2009),

each lower level proxy cache has a summary table containing its neighbor

proxy caches object information. The summary table was used to identify the

Page 21: CHAPTER 2 LITERATURE REVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35508/7/07_chapter2.pdf · Acharjee (2006) has listed the main advantages of web caching to the

40

location of the requested object when the object is not available in the proxy

cache. In addition, in order to boost the performance of the proxy cache, their

solution limits the number of executions of Finite Inductive (FI) Systems

depending upon the current available space of the proxy cache. The object

management model has reduced response time for the requested object by

minimizing unnecessary traffic and bandwidth usage between the low level

proxy caches and upper level proxy cache (Baek et al 2009).

Wang et al (2013) have designed an intra-Autonomous System

(AS) cache cooperation scheme to effectively control the redundancy level

within the AS. This system enables the neighboring nodes in an AS to

of limited cache resources. Intra-AS cache cooperation scheme has been

effective in solving caching issues in Content-Centric Networking (CCN).

Considering the fact that controlling the redundancy level is very important in

improving the AS-level caching performance, Wang et al (2013) have

introduced a greedy heuristic algorithm named CRE-P to eliminate

redundancy in the caching network of CCN. Thus the performance evaluation

consists of two parts. Initially, the efficiency of the greedy heuristic algorithm

in yielding an approximate solution to CRE-P was evaluated. Then the core

benefits brought by intra-AS cache cooperation scheme from different aspects

was analyzed. The results of trace based simulation showed that the simple

greedy heuristic is very efficient in eliminating redundancy.

According to (Wang et al 2013), the intra-AS cache cooperation

scheme offered two benefits. Firstly, caching slots released from redundancy

elimination could be used to cache other popular items. Secondly, the intra-

AS cache cooperation scheme provided a broader view of cached items at

neighboring nodes which actually has helped in serving locally unsatisfied

requests.

Page 22: CHAPTER 2 LITERATURE REVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35508/7/07_chapter2.pdf · Acharjee (2006) has listed the main advantages of web caching to the

41

Wang et al (2013) have evaluated the algorithms by deriving

network topologies derived from various sources like AS 1755, AS 3967,

Brite1 and Brite2.The router level topologies AS 1755 and AS 3967 were

derived from rocket fuel project (Spring et al 2004). On the other hand, Brite1

and Brite2 were generated from hierarchical top-down topologies known as

BRITE. The performance of the intra-AS cache cooperation scheme was

compared with ubiquitous LRU scheme with varying cache sizes. The results

of the simulation showed that intra-AS cache cooperation scheme performs

better in terms of hit rate and bandwidth saving across the four topologies.

Similarly, the intra-AS cache cooperation scheme improves the caching

performance of access routers and also reduced the AS cross-traffic without

overloading the internal links. Finally, the scheme can be further improved by

combining the vertical redundancy elimination approach with the horizontal

cache cooperation scheme.

Liu et al (2013) have designed a cooperative caching scheme for

Content Oriented networking (CON) with the intention of minimizing the

content access delay for mobile users. They have formulated the caching

problem as a mixed integer programming model and also a heuristic solution

based on Lagrangian relaxation. Simulation results show that this scheme can

greatly reduce content access delay.

Liu et al (2013) have compared the performance of their approach

with LRU (removing the content least recently used in the cache-router) and

FIFO (removing the content first cached in the cache-router) scheme in the

same environment. The metrics of the evaluation primarily included average

content access delay, apart from cache size of cache-router, number of users

and move speed to measure the primary metric. The results showed that the

suggested scheme has significantly improved the delay performance

Page 23: CHAPTER 2 LITERATURE REVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35508/7/07_chapter2.pdf · Acharjee (2006) has listed the main advantages of web caching to the

42

compared to existing algorithms such as LRU and FIFO under various cache

sizes, number of users and speed.

The scheme of Liu et al (2013) can be further developed including

detailed analytical models. In addition, the scheme has to be validated by

measuring various metrics to analyze the access parameters like cache size

and maximize number of users to ensure acceptable delay. Similarly, path

prediction algorithms could be employed in order to improve caching

decisions to detect regular user movement patterns.

Nikolaou et al (2013) have evaluated the efficiency and cost of

different placement strategies for a distributed cache implemented on the

clients of an online social network or web service and introduced a novel

cache placement strategy that leverages relationships between clients. In the

developed model, the service maintains a directory for content that tracks the

location of objects. In addition, the service also inform the requesting clients

about the location of the directory so that the clients could cache, serve, and

push content based on the directives provided by the service (Nikolaou et al

2013). The model was compared with three other placement strategies like

minimalistic scheme, opportunistic scheme and popularity-based algorithm

and the performance was evaluated. The metrics employed for evaluation

were the local cache hit ratio, global hit ratio and local outgoing bandwidth.

Overall results of the simulations revealed that the client relationship

Nikolaou et al (2013) have suggested the model can be extended further to

analyze whether social proximity is detrimental to latency when compared to

schemes that rely on geographic locality. Similarly, they also suggested

further research is necessary to study the possibility of designing a hybrid

strategy that combines the best features of the proactive and popularity

schemes.

Page 24: CHAPTER 2 LITERATURE REVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35508/7/07_chapter2.pdf · Acharjee (2006) has listed the main advantages of web caching to the

43

Taghizadeh et al (2013) have suggested a novel cooperative

caching scheme which can be used to minimize the electronic content

provisioning cost in Social Wireless Networks (SWNET). Content

provisioning cost rely on the service and pricing dependencies among various

stakeholders like content providers (CP), network service providers, and End

Consumers (EC).

delivery business (Taghizadeh et al 2013) have developed a practical network,

service, and pricing models which were used for creating object caching

strategies with homogenous and heterogeneous object demands. They have

also analyzed the caching strategies using analytical and simulation models in

the presence of selfishness that deviate from network-wide cost-

optimal policies. They also showed that selfishness can increase user rebate

only when the number of selfish nodes in an SWNET is less than a critical

number.

John et al (2013) have designed - A Proxy Agent for Client

Systems (APACS) which acts as an intermediate system to control the access

of users to any webpage or website. APACS enhanced the client-server

communication by adding network features and the internet capabilities by

taking into account of the safety concerns in the networking environment. The

client systems were installed with APACS and further the users has to be

authorized to get access to the Internet. APACS also offered a built-in

browser to the users with limited options. The administrator has the privilege

that it could restrict any website to a particular user or a group of users. Thus

the data usage was controlled and the performance of the system can be

improved. John et al (2013) described APACS as a system that works

between a client system and a server where the client system includes a built-

in web browser

Page 25: CHAPTER 2 LITERATURE REVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/35508/7/07_chapter2.pdf · Acharjee (2006) has listed the main advantages of web caching to the

44

John et al (2013) have evaluated the performance of APACS by

comparing with other popular web browsers. The results of the comparison

between the browsers showed that APACS performed better than a few

browsers when the features like speed and success rate were considered. The

major advantage of APACS is that it can act as an internet administration tool

for Windows OS.

2.6 SUMMARY

This chapter has focused on the important work carried out by

various researchers on the web caching and prefetching techniques. Particular

attention was given to the cooperative web caching systems. A through

review of literature was performed to identify the techniques, classification,

algorithms and research works on the cooperative web caching systems.

From the survey, it is identified that the machine learning

algorithms finds limited applications to information retrieval. The training

data set admitted in the cache without any admission control will admit

redundant data for further processing. In case of pattern identification all the

intelligent algorithms like NN, AI and GA are complex in nature and are

computationally very expensive in making the caching decisions. Traditional

cache management policies are not suitable for web caching systems. The

existing distributed system does not share their browser objects. Chord

network system is inspired by the way in which it allows to share their

contents within an authenticated group. Hence based on the survey, an

integrated cluster based proxy cache with hybrid based cache management

system is developed for information sharing. Information sharing can be

obtained by creating a chord network within a group.

The function of Access Log Manager (ALM), which is the first

module of the integrated system, is described in the next chapter.