Upload
elan
View
35
Download
1
Embed Size (px)
DESCRIPTION
Clustering Web Content for Efficient Replication. Yan Chen, Lili Qiu*, Weiyu Chen, Luan Nguyen, Randy H. Katz EECS Department UC Berkeley *Microsoft Research. Motivation. Amazing growth in WWW traffic Daily growth of roughly 7M Web pages Annual growth of 200% predicted for next 4 years - PowerPoint PPT Presentation
Citation preview
1
Clustering Web Content for Efficient Replication
Yan Chen, Lili Qiu*, Weiyu Chen, Luan Nguyen, Randy H. Katz
EECS DepartmentUC Berkeley
*Microsoft Research
2
Motivation• Amazing growth in WWW traffic
– Daily growth of roughly 7M Web pages– Annual growth of 200% predicted for next 4 years
• Content Distribution Network (CDN) commercialized to improve Web performance
– Un-cooperative pull-based replication• Paradigm shift: cooperative push more cost-effective
– Strategically push replicas can achieve close to optimal performance [JJKRS01, QPV01]
– Improving availability during flash crowds and disasters• Orthogonal issue: granularity of replication
– Per Website? Per URL? -> Clustering! – Clustering based on aggregated clients’ access patterns
• Adapt to users’ dynamic access patterns– Incremental clustering (online and offline)
3
Outlines• Motivation• Simulation methodology• Architecture• Problem Formulation• Granularity of replication• Dynamic replication
– Static clustering– Incremental clustering
• Conclusions
4
Simulation Methodology• Network Topology
– Pure-random, Waxman & transit-stub models from GT-ITM– A real AS-level topology from 7 widely-dispersed BGP peers
• Web WorkloadWeb Site
Period Duration # Requests avg –min-max
# Clients avg –min-max
# Client groups avg –min-max
MSNBC Aug-Oct/1999 10–11am 1.5M–642K–1.7M 129K–69K–150K 15.6K-10K-17KNASA Jul-Aug/1995 All day 79K-61K-101K 5940-4781-7671 2378-1784-3011
– Aggregate MSNBC Web clients with BGP prefix» BGP tables from a BBNPlanet router» 10K groups left, chooses top 10% covering >70% of requests
– Aggregate NASA Web clients with domain names– Map the client groups onto the topology
• Performance Metric: average retrieval cost– Sum of edge costs from client to its closest replica
5
Outlines• Motivation• Simulation methodology• Architecture• Problem Formulation• Granularity of replication• Dynamic replication
– Static clustering– Incremental clustering
• Conclusions
6
CDN name server
Client 1
Local DNS server
Local CDN server
1. G
ET re
ques
t
4. lo
cal C
DN
serv
er IP
ad
dres
sWeb content server
Client 2
Local DNS server
Local CDN server
2. Request for hostname resolution
3. Reply: local CDN server IP
address
5.GET request8. Response
6.GET request if cache miss
ISP 2
ISP 1
Conventional CDN: Un-cooperative Pull
7. Response
Big waste of replication!
7
CDN name server
Client 1
Local DNS server
Local CDN server
1. G
ET re
ques
t
4. R
edire
cted
serv
er I
P ad
dres
sWeb content server
Client 2
Local DNS server
Local CDN server
2. Request for hostname resolution
3. Reply: nearby replica server or
Web server IP address
ISP 2
ISP 1
5.GET request
6. Response
6. Response
5.GET request if no replica yet
Cooperative Push-based CDN
0. Pu
sh re
plica
s
Significantly reduce # of replicas and consequently,the update cost (only 4% of un-coop pull)
8
Problem Formulation
• Subject to the total replication cost• Find a replication strategy that minimize the total access cost
9
Outlines• Motivation• Simulation methodology• Architecture• Problem Formulation• Granularity of replication• Dynamic replication
– Static clustering– Incremental clustering
• Conclusions
10
Where R: # of replicas/URL K: # of clusters M: # of URLs (M >> K)C: # of clients S: # of CDN serversf: placement adaptation frequency
Replication Scheme States to Maintain Computation CostPer Website O (R) f × O(R × S × C) Per Cluster O(R × K + M) f × O(K × R × (K + S × C))Per URL O(R × M) f × O(M × R × (M + S × C))
• Use greedy placement• 30 – 70% average
retrieval cost reduction for Per URL
• Per URL is too expensive for management!
Replica Placement: Per Website vs. Per URL
11
Clustering Web Content• General clustering framework
– Define the correlation distance between URLs– Cluster diameter: the max distance b/t any two
members» Worst correlation in a cluster
– Generic clustering: minimize the max diameter of all clusters
• Correlation distance definition based on– Spatial locality– Temporal locality– Popularity– Semantics (e.g., directory)
12
Spatial Clustering
k
i
k
i ii
k
iii
BABA
BAdistcor1 1
22
11),(_
• Correlation distance between two URLs defined as– Euclidean distance– Vector similarity
• URL spatial access vector
– Blue URL
1
2
3
4
0201
13
Clustering Web Content (cont’d)
• Popularity-based clustering
– OR even simpler, sort them and put the first N/K elements into the first cluster, etc. - binary correlation
|)(_)(_|),(_ BfreqaccessAfreqaccessBAdistcor
)()(),(_
1),(_BoccurrenceAoccurrence
BAoccurrencecoBAdistcor
• Temporal clustering– Divide traces into multiple individuals’ access sessions [ABQ01]– In each session,
– Average over multiple sessions in one day
14
Performance of Cluster-based Replication
• Tested over various topologies and traces• Spatial clustering with Euclidean distance and
popularity-based clustering perform the best– Even small # of clusters (with only 1-2% of # of URLs) can
achieve close to per-URL performance
0
20
40
60
80
100
120
140
1 10 100 1000Number of clusters
Ave
rage
retr
ieva
l cos
t
Spatial clustering: Euclidean distanceSpatial clustering: cosine similarityTemporal clusteringAccess frequency clustering
MSNBC, 8/2/1999, 5 replicas/URL NASA, 7/1/1995, 3 replicas/URL
0
10
20
30
40
50
60
70
80
90
100
1 10 100 1000Number of clusters
Ave
rage
retr
ieva
l cos
t
Spatial clustering: Euclidean distanceSpatial clustering: cosine similarityTemporal clusteringAccess frequency clustering
15
Outlines• Motivation• Simulation methodology• Architecture• Problem Formulation• Granularity of replication• Dynamic replication
– Static clustering– Incremental clustering
• Conclusions
16
Static clustering and replication
• Two daily traces: old trace and new trace
• Static clustering performs poorly beyond a week– Average retrieval cost almost doubles
Methods Static 1 Static 2 OptimalTraces used for clustering Old Old NewTraces used for replication
Old New New
Traces used for evaluation
New New New
0
10
20
30
40
50
60
8/3 8/4 8/5 8/10 8/11 9/27 9/28 9/29 9/30 10/1New traces
Ave
rage
retr
ieva
l cos
t Staticclustering 1
Staticclustering 2
Reclustering,re-replication(optimal)
17
Incremental Clustering• Generic framework
1. If new URL u match with existing clusters c, add u to c and replicate u to existing replicas of c
2. Else create new clusters and replicate them• Online incremental clustering
– Push before accessed -> high availability– Predict access patterns based on semantics– Simplify to popularity prediction – Groups of URLs with similar popularity? Use hyperlink structures!
» Groups of siblings» Groups of the same hyperlink depth: smallest # of links from
root
18
Online Popularity Prediction
• Experiments– Use WebReaper to crawl http://www.msnbc.com on 5/3/2002
with hyperlink depth 4, then group the URLs– Use corresponding access logs to analyze the correlation– Groups of siblings has the best correlation
• Measure the divergence of URL popularity within a group:
)_()_(_
frequencyaccessaveragefrequencyaccessdevstd
access freq span =
19
Online Incremental Clustering• Semantics-based incremental clustering
– Put new URL into existing clusters with largest # of siblings– When there is a tie, choose the cluster with more replicas
• Simulation on 5/3/2002 MSNBC– 8-10am trace: static popularity clustering + replication– At 10am: 16 new URLs emerged - online incremental
clustering + replication – Evaluation with 10-12am trace: 16 URLs has 33,262 requests
1 2 3 4 5 6
+ ?
2 3 5 61 4
1
4
2
3
5 6
20
Online Incremental Clustering & Replication Results
Average retrieval cost reduction (16 URLs)• Compared with no replication of new URLs:
- 12.5%
• Compared with random replication of new URLs: - 21.7%
• Compared with static clustering + replication (oracle): - 200%
21
Conclusions• CDN operators: cooperative, clustering-based
replication– Cooperative: big savings on replica management and
update cost– Per URL replication outperforms per Website scheme
by 60-70%– Clustering solves the scalability issues, and gives the
full spectrum of flexibility» Spatial clustering and popularity-based clustering
recommended• To adapt to users’ access patterns: incremental
clustering – Hyperlink-based online incremental clustering for
» High availability» Performance improvement
– Offline incremental clustering performs close to optimal
22
Offline Incremental Clustering
• Study spatial clustering and popularity-based clustering
• Step 1: assign new URLs into existing clusters– When the correlation within that cluster (diameter) is
unchanged– Add it to existing replicas
• Step 2: Un-matched URLs - static clustering and replication
• Performance close to complete re-clustering + re-replication, with only 30-40% replication cost