View
218
Download
5
Tags:
Embed Size (px)
Citation preview
Bjorn Landfeldt, The University of Sydney
1
ELEC 5501Advanced Communication Networks
Web Caching andContent Distribution Networks
Bjorn Landfeldt, The University of Sydney
2
Outcomes
• Understand the drive for content replication
• Awareness of the differences between Caching and CDN and the similarities
• Awareness of the current best practices and standards
• Understanding how replication can help increasing QoS
Bjorn Landfeldt, The University of Sydney
3
Problem
• Massive amounts of data stored on servers
• Server capacity and network capacity limited
• Expensive to go “long distances” over the Internet
• Solution: replicate content or cache content
• Today, the web - tomorrow any data
Bjorn Landfeldt, The University of Sydney
4
Overview of Web Caching
• Cache server (proxy)
• Why caching in the network?
• Hierarchical caching
• Problems with caching
Strongly based on Keith Ross’s Tutorial
Bjorn Landfeldt, The University of Sydney
5
Cache Server
• A cache is both a server and a client
HTTP Request
HTTP Response
HTTP RequestHTTP Response
ISP Boundary
OriginServer
OriginServer
HTTP Request
HTTP Response
Cache Server (Proxy)
Client
Client
HTTP RequestHTTP Response
Bjorn Landfeldt, The University of Sydney
6
Why Cache in the Network
• Reduce latency by avoiding slow links between client and origin server– Low bandwidth links– Congested links
• Reduce traffic on links– Between institutional network and regional ISP– Reduce traffic on transoceanic links
• Spread load of overloaded origin server to caches– An Internet dense with cache allows a content provider
to offer high performance distribution at low cost• Inexpensive server• Low-bandwidth Internet connection
Bjorn Landfeldt, The University of Sydney
7
Implications of Cache in the
Network• Network caching complements client
caching
• Paradigm shift in traffic engineering– Bandwidth is no longer the only shared
resource; now there is bandwidth and storage
Bjorn Landfeldt, The University of Sydney
8
Hierarchical Caching
• Each ISP can have a cache
• ISPs higher in the hierarchy have– Larger user populations– Higher hit rates National ISP
Regional ISP Regional ISP
Local ISP
Clients
Origin Servers
= cache
Bjorn Landfeldt, The University of Sydney
9
Cache Chaining
• Hierarchies use cache chaining
• All communications along chain can be over HTTP
client cache server
User configures browser to point to cache
client cache servercache
User configures browser to point to 1st cache and first cache points to2nd cache ..
Bjorn Landfeldt, The University of Sydney
10
Cooperative Caching
• Multiple sibling caches within a single ISP• One or more of the siblings could contain
the requested object• Cooperation
– ICP: siblings send messages to each other to find a copy of object
– CARP: URL space partitioned Siblingcaches
Clients
Bjorn Landfeldt, The University of Sydney
11
Caching Challenges• Cache consistency:
– Cache often must guess whether a stored object is state or fresh
• Dynamic content:– Caches shouldn't cache outputs of CGI scripts
• Hit counts and personalization:– Caches can cause hit count calculations and cookie transactions to fail
• Less-savvy users and privacy-concerned users:– How do you get a user to point his/her browser to a cache?
• Access control:– How do you make sure that the seller of the documents gets paid?– Legal and security restrictions
• Enormous multimedia files:– Disk storage is increasing at a rate of approx 60% a year still!
Bjorn Landfeldt, The University of Sydney
12
Replication Caching• When an ftp or HTTP server is very busy it can replicate
itself– Load balancing distributes the load across all servers
• Round-robin DNS– Maps a single host names to multiple servers with different IP
addresses– DNS rotates the IP addresses each time it receives a request
• Re-directions– Webserver returns a re-direction to a parallel server
• Can be done with a 301 Moved Permanently and location: header in response message
• Main server re-directs request to a pool of servers
Bjorn Landfeldt, The University of Sydney
13
Round Robin DNSAdvantages & Disadvantages
• Advantages– Inexpensive– Easy to set up– Application OS independent– Requires no resources from the application servers
• Disadvantages– Doesn't monitor server load– Doesn't remove failed servers from the rotation– Won't work well if servers are of different size/power– Doesn't work well if session state must be maintained– DNS Caching causes problems
Bjorn Landfeldt, The University of Sydney
14
DNS Redirection• Some intermediary intercepts the request, and
directs it to a selected site.– Layer 4-7 switching? E.g., look at URL or server IP
address.– Interpose on the binding procedure, before the client
sends the request itself.• Smart clients, Active Names, RPC binding, or DNS lookup
• Most third-party CDNs are based on DNS servers that select the cache/replica site on DNS lookup for the request.
• Akamai, Digital Island, Web hosting providers (e.g., Exodus), etc.• Like DNS-RR....but smarter...
Bjorn Landfeldt, The University of Sydney
15
Pre-fetching Cache
• Retrieves specific pages or sites at regular intervals– Can also pre-fetch pages that are outside the referenced
site
– Pre-fetch their referenced pages
– Can pre-fetch a hierarchy of pages across a number of sites
• May also perform periodic up-to-date checks on all documents in the cache
Bjorn Landfeldt, The University of Sydney
16
Cache Effectiveness
• Previous work has shown that hit rate increases with population size
• However, single proxy caches have practical limits– Load, network topology, organizational
constraints
• One technique to scale the client population is to have proxy caches cooperate
Bjorn Landfeldt, The University of Sydney
17
Resolve misses through the parent.
Hierarchical Caches
clients
origin Web site
clientsclients
Idea: place caches at exchange or switching points in the network, and cache at each level of the hierarchy.
upstream
downstream
Internet
Bjorn Landfeldt, The University of Sydney
18
Cache Array Resolution Protocol
• A set of caching proxies can effectively function as a single logical cache
• Uses a hash function to partition the URLs across caches
• All queries are done over HTTP– No new application layer protocol such as ICP– Can take advantage of HTTP/1.1
• Implemented in MS and Netscape cache server products
Bjorn Landfeldt, The University of Sydney
19
Operation
• A client trying to locate a cached resource targets the request to the appropriate cache by applying a hash function
• The hash function uses the request URL and the identity of the proxy members to construct a resolution path
Bjorn Landfeldt, The University of Sydney
20
Hash Routing Overview
• Choose a hash function h() which maps URLs to a hash space– Let the hash space be {1,….,60}– Let h() be the sum of the ASCII representation of the characters
in the URL, modulo 60
• Partition hash space: one set for each sibling– Client hashes URL, determines set to which hashed URL
belongs and sends request to corresponding sibling– Set for cache 1 = {1,…,30}, set for cache 2 = {31,…,60}– h(URLa) = 35, send request to cache 2– If sibling does not have object, obtain from origin server
Bjorn Landfeldt, The University of Sydney
21
Hashing: Cache Array Routing Protocol
(CARP)
“GET www.hotsite.com”
g-pv-z
q-ua-f
Advantages1. single-hop request resolution2. no redundant caching of objects3. allows client-side implementation4. no new cache-cache protocols5. reconfigurable
hashfunction
Internet
Bjorn Landfeldt, The University of Sydney
22
Hash Routing (2)
• Each object resides in at most one sibling
• Client is immediately directed to the correct sibling– Disk and RAM storage are effectively
aggregated -> higher hit rates
Bjorn Landfeldt, The University of Sydney
23
Content Distribution Networks
• Thus far we have looked at caching– Caches are provided by the ISP (network) or the
client Forward Proxy Caches
Edward Chow
Bjorn Landfeldt, The University of Sydney
24
Another Solution• Push Content to the edges of the network
Edward Chow
Bjorn Landfeldt, The University of Sydney
25
Content Distribution Networks (CDNs)
• Be proactive and distribute the content closer to the clients
• The distribution infrastructure is not owned by the ISP, or the owners of content– Third Party
• A CDN is a collection of interconnected cache servers that are scattered around the world which are able to serve a client
Bjorn Landfeldt, The University of Sydney
26
Basic CDN Operation
• When a request is sent to the server (origin), it is redirected to another server (proxy cache server) which is closer and/or can serve faster– The origin server must be able to determine the
location of the client and find the appropriate proxy cache server
Bjorn Landfeldt, The University of Sydney
27Generalized Cache/CDN (Internal
View)
Leaf Caches(e.g., ISP proxies)
Interior Cachesroot caches
reverse proxiesCDN caches
Request Routing
Function ƒ
bound client populations
ƒ
Jeff Chase
Bjorn Landfeldt, The University of Sydney
28
CDN Challenges
• Challenges are– Which cache server to use (request routing
function) ?– When/where/how to push/delivery the content
(content distribution)?– Where to put cache servers?– Associated questions
• How many cache servers are needed?• How about dynamic content?
Bjorn Landfeldt, The University of Sydney
29
How L4-Aware Systems Work
• By making intelligent switching decisions and to forward frames based on TCP/UDP port information and IP source/destination addresses
• L4 switching=Session Switching– examines client requests directed at the L4 switch– multiplexes client requests across any server available to handle those requests– passively measures application health and responsiveness to determine server
availability– stateful processing
• By combining the benefits of L4 sofware on a high-speed L2 switching platform
• By using this information to establish policy controls for how traffic is to be managed
Bjorn Landfeldt, The University of Sydney
30
Key Layer 4-based Applications
1. Local/Global Server load balancing2. High availability applications3. Web Cache Redirection4. DNS redirection5. Firewall Load Balancing 6. URL-based redirection, switching
Bjorn Landfeldt, The University of Sydney
31
E.g. Local Server Load Balancing
Clients
HTTP
DNS
FTP HTTP
Database Queries
DNS
FTP
• Scalable application processing capacity– Add servers on-demand
• High availability– Server/application health monitoring– Backup and overflow servers – Hot-standby switch configurations
• Tiers-of-service by servers – Priority users/applications can be
directed to premium servers • Integrated switch and load balancer
– Flexibility– Scalability– Economy of scale– Performance
Bjorn Landfeldt, The University of Sydney
32
Alternative Solution• Intelligent DNS-based request routing has some tricky
parts:– Third-party CDNs contract with content providers (e.g., Web sites
such as cnn.com) to serve a subset of their content.• Resource-rich content, e.g., images, audio, video.
– To use DNS request routing, the CDN must assume DNS duties for the URLs that reference the content it serves.
– The content provider does not want to designate the CDN as the authoritative DNS server for its domain (e.g., cnn.com).
• Solution: make up new DNS domains for the content served by the CDN – URL rewriting
Jeff Chase
Bjorn Landfeldt, The University of Sydney
33
URL Rewriting
• Origin server dynamically generates pages to redirect clients to different content servers
• Page is dynamically rewritten with the IP address of a mirror server.
Bjorn Landfeldt, The University of Sydney
34
Pre-Caching
• Content is delivered to cache before requests are generated
• Used for highly distributed usage• Caches can be updated during off-hours
to reduce network load• There are no standardised schemes so we
will look at how this is done in practice
Bjorn Landfeldt, The University of Sydney
35
Just-In-Time
• Content is pulled from the origin server to the cache when a request is received from a client
• The object is delivered to the client and simultaneously stored on the cache for later use
• Can implement multicasting for efficient content transfer between caches
• Leased lines may be used between servers to ensure QoS
Bjorn Landfeldt, The University of Sydney
36
Example - Akamai (1)
• Akamai sells a content delivery service that looks like what a hosting company sells as Internet interconnection bandwidth
• When you "Akamaize" content, the content is subsequently served by Akamai’s system rather than from the origin server.
• Content provider pays Akamai on the basis of the peak load experienced (in Mbits/second - just like bandwidth).
• The net result is usually a significant improvement in access performance
Bjorn Landfeldt, The University of Sydney
37
Akami (2)
• Have implemented a distributed network of servers on multiple service provider backbones across the Internet– No central server that knows about all proxies and controls them – They have put proxies in the networks of many service providers – This way they hope that every client will be in the vicinity of at
least one of them• In order for servers to cooperate and exchange information
(sort of pre-fetching) they have developed a dynamic discovery scheme called Name-Dropper
• The way they have picked these locations for the proxies is not known
Bjorn Landfeldt, The University of Sydney
38
Akami – Operation (1)
• Size of the majority of web pages (according to Akami 70%), is driven not by the text it contains but from other embedded objects– Get the text from the server and all the other objects from a nearby
proxy• Every page that is served by the Akamai network
1) is passed through a program that tags all embedded objects2) When the client downloads this page and request the embedded
objects are directed to a nearby proxy– Therefore every client gets a slightly different page
• This can be done in two ways– Dynamically generating the proper code and feeding it to client– Intercepting packets that have references to the tagged objects and
change them on the fly.
Bjorn Landfeldt, The University of Sydney
39
Akami – Operation (2)
Bjorn Landfeldt, The University of Sydney
40
Akami – Operation (3)
• The Akamai scheme basically relies on a preprocessing phase where the large objects of a page are identified and tagged
• No other changes to the original page or the software for generating the page
• Then these objects are distributed to some proxies– Map different objects to different proxies in order to
balance the traffic
Bjorn Landfeldt, The University of Sydney
41
Akami – Operation (4)
• Akamai has developed a technique for mapping objects to proxies which is called consistent hashing
• The client decides which proxy contains the required information and can deliver it faster
• client's software doesn't have the capability to perform such a function!
• Performed during the resolving of names to IP addresses using the Akamai's DNS Server– The DNS server performs the hashing function for the
client and return as answer the IP address of the closest proxy
Bjorn Landfeldt, The University of Sydney
42
Domain Granularity and “Akamaizing”
– Akamai creates new domain names for each client content provider.
• e.g., a128.g.akamai.net
– Akamai’s DNS servers are authoritative for the new domains.
– The client content provider modifies its content so that embedded URLs reference the new domains.
• “Akamaize” content, e.g.: http://www.cnn.com/image-of-the-day.gif becomes http://a128.g.akamai.net/image-of-the-day.gif.
– Using multiple domain names for each client allows the CDN to further subdivide the content into groups.
• DNS sees only the requested domain name, but it can route requests for different domains independently.
Jeff Chase