Bjorn Landfeldt, The University of Sydney 1 ELEC 5501 Advanced Communication Networks Web Caching and Content Distribution Networks

Bjorn Landfeldt, The University of Sydney

1

ELEC 5501Advanced Communication Networks

Web Caching andContent Distribution Networks


2

Outcomes

• Understand the drive for content replication

• Awareness of the differences between Caching and CDN and the similarities

• Awareness of the current best practices and standards

• Understanding how replication can help increasing QoS


3

Problem

• Massive amounts of data stored on servers

• Server capacity and network capacity limited

• Expensive to go “long distances” over the Internet

• Solution: replicate content or cache content

• Today, the web - tomorrow any data


4

Overview of Web Caching

• Cache server (proxy)

• Why caching in the network?

• Hierarchical caching

• Problems with caching

Strongly based on Keith Ross’s Tutorial


5

Cache Server

• A cache is both a server and a client

HTTP Request

HTTP Response

HTTP RequestHTTP Response

ISP Boundary

OriginServer

OriginServer

HTTP Request

HTTP Response

Cache Server (Proxy)

Client

Client

HTTP RequestHTTP Response


6

Why Cache in the Network

• Reduce latency by avoiding slow links between client and origin server– Low bandwidth links– Congested links

• Reduce traffic on links– Between institutional network and regional ISP– Reduce traffic on transoceanic links

• Spread load of overloaded origin server to caches– An Internet dense with cache allows a content provider

to offer high performance distribution at low cost• Inexpensive server• Low-bandwidth Internet connection


7

Implications of Cache in the

Network• Network caching complements client

caching

• Paradigm shift in traffic engineering– Bandwidth is no longer the only shared

resource; now there is bandwidth and storage


8

Hierarchical Caching

• Each ISP can have a cache

• ISPs higher in the hierarchy have– Larger user populations– Higher hit rates National ISP

Regional ISP Regional ISP

Local ISP

Clients

Origin Servers

= cache


9

Cache Chaining

• Hierarchies use cache chaining

• All communications along chain can be over HTTP

client cache server

User configures browser to point to cache

client cache servercache

User configures browser to point to 1st cache and first cache points to2nd cache ..


10

Cooperative Caching

• Multiple sibling caches within a single ISP• One or more of the siblings could contain

the requested object• Cooperation

– ICP: siblings send messages to each other to find a copy of object

– CARP: URL space partitioned Siblingcaches

Clients


11

Caching Challenges• Cache consistency:

– Cache often must guess whether a stored object is state or fresh

• Dynamic content:– Caches shouldn't cache outputs of CGI scripts

• Hit counts and personalization:– Caches can cause hit count calculations and cookie transactions to fail

• Less-savvy users and privacy-concerned users:– How do you get a user to point his/her browser to a cache?

• Access control:– How do you make sure that the seller of the documents gets paid?– Legal and security restrictions

• Enormous multimedia files:– Disk storage is increasing at a rate of approx 60% a year still!


12

Replication Caching• When an ftp or HTTP server is very busy it can replicate

itself– Load balancing distributes the load across all servers

• Round-robin DNS– Maps a single host names to multiple servers with different IP

addresses– DNS rotates the IP addresses each time it receives a request

• Re-directions– Webserver returns a re-direction to a parallel server

• Can be done with a 301 Moved Permanently and location: header in response message

• Main server re-directs request to a pool of servers


13

Round Robin DNSAdvantages & Disadvantages

• Advantages– Inexpensive– Easy to set up– Application OS independent– Requires no resources from the application servers

• Disadvantages– Doesn't monitor server load– Doesn't remove failed servers from the rotation– Won't work well if servers are of different size/power– Doesn't work well if session state must be maintained– DNS Caching causes problems


14

DNS Redirection• Some intermediary intercepts the request, and

directs it to a selected site.– Layer 4-7 switching? E.g., look at URL or server IP

address.– Interpose on the binding procedure, before the client

sends the request itself.• Smart clients, Active Names, RPC binding, or DNS lookup

• Most third-party CDNs are based on DNS servers that select the cache/replica site on DNS lookup for the request.

• Akamai, Digital Island, Web hosting providers (e.g., Exodus), etc.• Like DNS-RR....but smarter...


15

Pre-fetching Cache

• Retrieves specific pages or sites at regular intervals– Can also pre-fetch pages that are outside the referenced

site

– Pre-fetch their referenced pages

– Can pre-fetch a hierarchy of pages across a number of sites

• May also perform periodic up-to-date checks on all documents in the cache


16

Cache Effectiveness

• Previous work has shown that hit rate increases with population size

• However, single proxy caches have practical limits– Load, network topology, organizational

constraints

• One technique to scale the client population is to have proxy caches cooperate


17

Resolve misses through the parent.

Hierarchical Caches

clients

origin Web site

clientsclients

Idea: place caches at exchange or switching points in the network, and cache at each level of the hierarchy.

upstream

downstream

Internet


18

Cache Array Resolution Protocol

• A set of caching proxies can effectively function as a single logical cache

• Uses a hash function to partition the URLs across caches

• All queries are done over HTTP– No new application layer protocol such as ICP– Can take advantage of HTTP/1.1

• Implemented in MS and Netscape cache server products


19

Operation

• A client trying to locate a cached resource targets the request to the appropriate cache by applying a hash function

• The hash function uses the request URL and the identity of the proxy members to construct a resolution path


20

Hash Routing Overview

• Choose a hash function h() which maps URLs to a hash space– Let the hash space be {1,….,60}– Let h() be the sum of the ASCII representation of the characters

in the URL, modulo 60

• Partition hash space: one set for each sibling– Client hashes URL, determines set to which hashed URL

belongs and sends request to corresponding sibling– Set for cache 1 = {1,…,30}, set for cache 2 = {31,…,60}– h(URLa) = 35, send request to cache 2– If sibling does not have object, obtain from origin server


21

Hashing: Cache Array Routing Protocol

(CARP)

“GET www.hotsite.com”

g-pv-z

q-ua-f

Advantages1. single-hop request resolution2. no redundant caching of objects3. allows client-side implementation4. no new cache-cache protocols5. reconfigurable

hashfunction

Internet


22

Hash Routing (2)

• Each object resides in at most one sibling

• Client is immediately directed to the correct sibling– Disk and RAM storage are effectively

aggregated -> higher hit rates


23

Content Distribution Networks

• Thus far we have looked at caching– Caches are provided by the ISP (network) or the

client Forward Proxy Caches

Edward Chow


24

Another Solution• Push Content to the edges of the network

Edward Chow


25

Content Distribution Networks (CDNs)

• Be proactive and distribute the content closer to the clients

• The distribution infrastructure is not owned by the ISP, or the owners of content– Third Party

• A CDN is a collection of interconnected cache servers that are scattered around the world which are able to serve a client


26

Basic CDN Operation

• When a request is sent to the server (origin), it is redirected to another server (proxy cache server) which is closer and/or can serve faster– The origin server must be able to determine the

location of the client and find the appropriate proxy cache server


27Generalized Cache/CDN (Internal

View)

Leaf Caches(e.g., ISP proxies)

Interior Cachesroot caches

reverse proxiesCDN caches

Request Routing

Function ƒ

bound client populations

ƒ

Jeff Chase


28

CDN Challenges

• Challenges are– Which cache server to use (request routing

function) ?– When/where/how to push/delivery the content

(content distribution)?– Where to put cache servers?– Associated questions

• How many cache servers are needed?• How about dynamic content?


29

How L4-Aware Systems Work

• By making intelligent switching decisions and to forward frames based on TCP/UDP port information and IP source/destination addresses

• L4 switching=Session Switching– examines client requests directed at the L4 switch– multiplexes client requests across any server available to handle those requests– passively measures application health and responsiveness to determine server

availability– stateful processing

• By combining the benefits of L4 sofware on a high-speed L2 switching platform

• By using this information to establish policy controls for how traffic is to be managed


30

Key Layer 4-based Applications

1. Local/Global Server load balancing2. High availability applications3. Web Cache Redirection4. DNS redirection5. Firewall Load Balancing 6. URL-based redirection, switching


31

E.g. Local Server Load Balancing

Clients

HTTP

DNS

FTP HTTP

Database Queries

DNS

FTP

• Scalable application processing capacity– Add servers on-demand

• High availability– Server/application health monitoring– Backup and overflow servers – Hot-standby switch configurations

• Tiers-of-service by servers – Priority users/applications can be

directed to premium servers • Integrated switch and load balancer

– Flexibility– Scalability– Economy of scale– Performance


32

Alternative Solution• Intelligent DNS-based request routing has some tricky

parts:– Third-party CDNs contract with content providers (e.g., Web sites

such as cnn.com) to serve a subset of their content.• Resource-rich content, e.g., images, audio, video.

– To use DNS request routing, the CDN must assume DNS duties for the URLs that reference the content it serves.

– The content provider does not want to designate the CDN as the authoritative DNS server for its domain (e.g., cnn.com).

• Solution: make up new DNS domains for the content served by the CDN – URL rewriting

Jeff Chase


33

URL Rewriting

• Origin server dynamically generates pages to redirect clients to different content servers

• Page is dynamically rewritten with the IP address of a mirror server.


34

Pre-Caching

• Content is delivered to cache before requests are generated

• Used for highly distributed usage• Caches can be updated during off-hours

to reduce network load• There are no standardised schemes so we

will look at how this is done in practice


35

Just-In-Time

• Content is pulled from the origin server to the cache when a request is received from a client

• The object is delivered to the client and simultaneously stored on the cache for later use

• Can implement multicasting for efficient content transfer between caches

• Leased lines may be used between servers to ensure QoS


36

Example - Akamai (1)

• Akamai sells a content delivery service that looks like what a hosting company sells as Internet interconnection bandwidth

• When you "Akamaize" content, the content is subsequently served by Akamai’s system rather than from the origin server.

• Content provider pays Akamai on the basis of the peak load experienced (in Mbits/second - just like bandwidth).

• The net result is usually a significant improvement in access performance


37

Akami (2)

• Have implemented a distributed network of servers on multiple service provider backbones across the Internet– No central server that knows about all proxies and controls them – They have put proxies in the networks of many service providers – This way they hope that every client will be in the vicinity of at

least one of them• In order for servers to cooperate and exchange information

(sort of pre-fetching) they have developed a dynamic discovery scheme called Name-Dropper

• The way they have picked these locations for the proxies is not known


38

Akami – Operation (1)

• Size of the majority of web pages (according to Akami 70%), is driven not by the text it contains but from other embedded objects– Get the text from the server and all the other objects from a nearby

proxy• Every page that is served by the Akamai network

1) is passed through a program that tags all embedded objects2) When the client downloads this page and request the embedded

objects are directed to a nearby proxy– Therefore every client gets a slightly different page

• This can be done in two ways– Dynamically generating the proper code and feeding it to client– Intercepting packets that have references to the tagged objects and

change them on the fly.


39



40


• The Akamai scheme basically relies on a preprocessing phase where the large objects of a page are identified and tagged

• No other changes to the original page or the software for generating the page

• Then these objects are distributed to some proxies– Map different objects to different proxies in order to

balance the traffic


41


• Akamai has developed a technique for mapping objects to proxies which is called consistent hashing

• The client decides which proxy contains the required information and can deliver it faster

• client's software doesn't have the capability to perform such a function!

• Performed during the resolving of names to IP addresses using the Akamai's DNS Server– The DNS server performs the hashing function for the

client and return as answer the IP address of the closest proxy


42

Domain Granularity and “Akamaizing”

– Akamai creates new domain names for each client content provider.

• e.g., a128.g.akamai.net

– Akamai’s DNS servers are authoritative for the new domains.

– The client content provider modifies its content so that embedded URLs reference the new domains.

• “Akamaize” content, e.g.: http://www.cnn.com/image-of-the-day.gif becomes http://a128.g.akamai.net/image-of-the-day.gif.

– Using multiple domain names for each client allows the CDN to further subdivide the content into groups.

• DNS sees only the requested domain name, but it can route requests for different domains independently.

Jeff Chase

Documents

Bjorn Landfeldt, The University of Sydney 1 ELEC 5501 Advanced Communication Networks Web Caching and Content Distribution Networks