View
214
Download
0
Tags:
Embed Size (px)
Citation preview
2nd International Workshop on Challengesin Web Information Retrieval and Integration (WIRI)
1
March 15, 2004
George PallisKonstantinos Stamos
Athena VakaliDimitrios Katsaros
Antonis Sidiropoulos Yannis Manolopoulos
Programming Languages & Software Engineering Lab
Department of InformaticsAristotle Univ. of Thessaloniki, Greece
Replication based on Objects Load under a Content Distribution Network
http://www.csd.auth.gr/~oswindshttp://www.csd.auth.gr/~oswinds
2nd International Workshop on Challengesin Web Information Retrieval and Integration (WIRI)
2
March 15, 2004
INTRODUCTION
The Problem• Congested lines, obsolete backbones, multimedia content,
increasing user populations Great Internet Traffic JamSolutions• Increasing Bandwidth• Web Caching
– temporary storage of objects closer to the consumer • Web Prefetching
– the process of predicting future requests for Web objects and bringing those objects into the cache in the background, before an explicit request is made for them
• Content Distribution Networks (CDNs) – moving the content to the “edge” of the Internet, closer to the
end-user
2nd International Workshop on Challengesin Web Information Retrieval and Integration (WIRI)
3
March 15, 2004
Content Distribution Network (CDN)
• A CDN (such as Akamai, Mirror Image etc.) is a network of cache servers, called surrogate servers, owned by the same Internet Service Provider (ISP) that delivers content to users on behalf of content providers.
• Surrogate servers are typically shared, delivering content belonging to multiple Web sites.
• The networking functional components of a CDN include user redirection services, distribution services, accounting and billing
2nd International Workshop on Challengesin Web Information Retrieval and Integration (WIRI)
4
March 15, 2004
Content Distribution Network (2)
2nd International Workshop on Challengesin Web Information Retrieval and Integration (WIRI)
6
March 15, 2004
CDNs Challenging Issues
• Replica/Surrogate server placement problem – where should be located the surrogate servers?
• Content selection problem – which content should be outsourced?
• Content replication problem – which surrogate servers should replicate the
outsourced content?
2nd International Workshop on Challengesin Web Information Retrieval and Integration (WIRI)
7
March 15, 2004
Motivation
• We study the Content Replication Problem– NP-Complete– Existing heuristic methods for optimally replicating the
“outsourced content” in surrogate servers over a CDN• Random
– Naive, unscalable approach
• Popularity– Requires popularity statistics (e.g. users traffic)
• Greedy-single– Requires popularity statistics, huge memory requirements
• Greedy-global– Requires popularity statistics, huge memory requirements
2nd International Workshop on Challengesin Web Information Retrieval and Integration (WIRI)
8
March 15, 2004
Contribution
• We provide a novel strategy for optimally placing outsourced objects in CDN’s surrogate servers, integrating both the network’s latency and the objects’ load.
• We develop an analytic simulation environment to test the efficiency of the proposed scheme. – Using real and synthetically generated test data, we
show the robustness and efficiency of the proposed method which can reap performance benefits better than an analogous heuristic method.
2nd International Workshop on Challengesin Web Information Retrieval and Integration (WIRI)
9
March 15, 2004
The il2p (integration of latency and load object placement) Algorithm
• Main ideaConsidering that all the outsourced objects are initially placed on an origin server, the content replication problem is separated into two sub-problems:–Choice of the best surrogate server to replicate an outsourced object (based on the network’s latency)–Arrangement priorities for outsourced objects replication (based on the objects’ load)
2nd International Workshop on Challengesin Web Information Retrieval and Integration (WIRI)
10
March 15, 2004
The il2p Algorithm (2)Choice of the best surrogate server to replicate
an outsourced object• For each outsourced object, we select its
optimal surrogate server such that it minimizes:
• Dik(x) is the “distance” to a replica of object k from surrogate server i under the placement x
• the distance reflects the latency (the elapsed time between when a user issues a request and when it receives the response)
• N is the number of surrogate servers, K is the number of outsourced objects, λi is the request rate for surrogate server i, and pκ is the probability that a client will request the object k.
N
i
K
kikN
jj
ik xDp
xt1 1
1
)()(cos
2nd International Workshop on Challengesin Web Information Retrieval and Integration (WIRI)
11
March 15, 2004
The il2p Algorithm (3)Arrangement priorities for outsourced
objects replication• From the objects assigned to a single
server we replicate the one which has the maximum utility value.
Utility_Valuek=loadk*latencyk,where loadk=access_ratek*sk
– latencyk is the latency that the object k produces if it is replicated to the surrogate server which has been determined by the previous step, loadk is the total load due to object k and access_ratek is defined as the number of accesses of object k per unit time.
2nd International Workshop on Challengesin Web Information Retrieval and Integration (WIRI)
12
March 15, 2004
The il2p Algorithm: The Flowchart
For each outsourced object, we find which is the best surrogate server in order to place it (produces
the minimum network latency (min Dik(x)) )
We select from all the pairs of outsoursed object – surrogate server
that have been occurred in the previous step, the one with the maximum utility value and thus place this object to that surrogate
server
Surrogate servers become
full?
NoYes
All the “outsourced objects” are stored in the origin server and all the CDN’s surrogate servers
are empty
The final Placement
CDN Infrastructureoutsourced
objects
2nd International Workshop on Challengesin Web Information Retrieval and Integration (WIRI)
14
March 15, 2004
Simulation Testbed
• We use trace-driven simulations developing an analytic simulation environment:– a system model simulating the CDN infrastructure
(CDNsim)
– a network topology generator
– a Web site generator, modeling file sizes, linkage, etc.
– a client request stream generator capturing the main characteristics of Web users' behavior
2nd International Workshop on Challengesin Web Information Retrieval and Integration (WIRI)
15
March 15, 2004
System Model
• We have implemented a simulation model for CDNs (CDNsim) using the ParaSol library– CDN networking issues are computed dynamically via the
simulation model– Provides an implementation as close as possible to the working
TCP/IP protocol– Uses Bloom filters for memory-efficient CDN simulation
• We consider a CDN infrastructure consisting of 20 surrogate servers– All the surrogate servers have the same storage capacity
2nd International Workshop on Challengesin Web Information Retrieval and Integration (WIRI)
16
March 15, 2004
CDNsim: A Simulation Tool for Content Distribution Networks
http://oswinds.csd.auth.gr/~cdnsim
2nd International Workshop on Challengesin Web Information Retrieval and Integration (WIRI)
17
March 15, 2004
Network Topology
• Using the GT-ITM internetwork topology generator, we generate a random network topology, called Waxman, with a total of 1008 nodes
• Using BGP routing data collected from a set of 7 geographically-dispersed BGP peers in April 2000, we construct an AS-level Internet topology with a total of 3037 nodes
2nd International Workshop on Challengesin Web Information Retrieval and Integration (WIRI)
18
March 15, 2004
Web Site Generation
• Using the R-MAT tool, we construct Web graphs (Web sites)– The R-MAT produces realistic Web graphs capturing
the essence of each graph in only a few parameters
• We create two graphs with varying number of nodes (objects)– sparse-density graph (4000 nodes)
– moderate-density graph (3000 nodes)
2nd International Workshop on Challengesin Web Information Retrieval and Integration (WIRI)
19
March 15, 2004
Request Streams Generation
• Using a requests’ generator, we generate clients’ transactions
• Given a Web site graph, we generate transactions as sequences of page traversals (random walks) upon the site graph
2nd International Workshop on Challengesin Web Information Retrieval and Integration (WIRI)
20
March 15, 2004
Performance Evaluation• Examined Methods
– Random: Assigns the outsourced objects to CDN’s surrogate servers randomly subjected to the storage constraints. Both the outsourced object and the surrogate server are selected by uniform probability.
– Popularity: Each surrogate server stores the most popular outsourced objects among its clients. The node sorts the objects in decreasing order of popularity and stores as many outsourced objects in this order as the storage constraint allows.
– Lat-cdn: The outsourced objects are placed to surrogate servers with respect to the total network’s latency, without taking into account the objects’ popularity. Each surrogate server stores the outsourced objects which produce the maximum latency.
2nd International Workshop on Challengesin Web Information Retrieval and Integration (WIRI)
21
March 15, 2004
il2p for Typical Object Sizes• Average Response Time for Moderate-density
Web site graphs (3000 objects)
The size of the cache is expressed in terms of the percentage of the total number of bytes of the Web site
As the cache size increases, the average response time also increases since the larger in size caches satisfy more requests
2nd International Workshop on Challengesin Web Information Retrieval and Integration (WIRI)
22
March 15, 2004
il2p for Typical Object Sizes (2)• Average Response Time for Sparse-density
Web site graphs (4000 objects)
The difference in performance between il2p and the other three heuristics is quite significant (in most cases around 5% absolute improvement with respect to lat-cdn and consistently around 25% absolute improvement with respect to other two heuristics )
2nd International Workshop on Challengesin Web Information Retrieval and Integration (WIRI)
23
March 15, 2004
il2p Limitations• Average Response Time for Real Web Site
(Stanford Web site - 281903 Web objects)
We use a different scale for the cache sizes (compared with the previous ones) due to the large amount of objects of the Stanford Web site
The response times are too small because the majority of objects of Stanford Web site have very small sizes
2nd International Workshop on Challengesin Web Information Retrieval and Integration (WIRI)
24
March 15, 2004
Conclusion
• We addressed the content replication problem for CDNs.
• Our goal is to find an efficient placement so that when clients fetch objects from the nearest surrogate server, the average response time is minimized.
• For the future, we plan to investigate the content replication problem in CDNs integrating both caching and replication.