Web Caching Dr. Yingwu Zhu. What is Web Caching Introducing proxy servers at certain points in the...

Preview:

Citation preview

Web Caching

Dr. Yingwu Zhu

What is Web Caching

• Introducing proxy servers at certain points in the network that serve in caching Web documents for faster client access.

• Comparable to the cache memory in a computer system

Proxy Cache

clients

proxy

servers

Reply

Req.Req.

Reply

How?

• Client send requests to the proxy.• If the requested document is in its

cache, the proxy serves the request from its cache.

• Otherwise, the proxy forward the request to the server.

• Server replies the request through the proxy (proxy keep a copy of the requested document).

Why Web Caching?

• Rapid growth in HTTP traffic to form the largest part of the Internet traffic which causes more network congestion and server unavailability.

• The number of Web static pages almost doubles every year

• Some old data– Number of unique pages: 800M < X < 2.2B – Number of unique web sites: 8,500,000– static pages: %30 - %40– pages revisited: %80– expected hit-rate: %24 - %32

Why Web Caching?

• Bandwidth

• Latency

• Performance = Response Time

• Server Load

• Failure Redundancy

Expected Gains

• Bandwidth saving• Improving content availability.• Improving web server availability.• Server load balancing.• Reducing user-perceived latency

What: Content and Protocols

• HTTP 1.0 Basic protocol– Send Request based on fix number of

verbs• GET• HEAD• POST

– Receive response, meta-data, content

What: Content and Protocols• HTTP Request

Request = Simple-Request | Full-Request

Simple-Request = "GET" SP Request-URI CRLF

Full-Request = Request-Line ; * ( General-Header ;

| Request-Header ;| Entity-Header ) ;

CRLF[ Entity-Body ]

What: Content and Protocols

• Example: GET /pub/www/index.html HTTP/1.0

• Response:HTTP/1.1 200 OKServer: Microsoft-IIS/5.0Date: Sat, 19 Oct 2002 05:46:53 GMTExpires: Sun, 20 Oct 2002 16:00:00 GMTContent-Length: 2291Content-Type: text/htmlCache-control: private

What: Content and Protocols

• Example “if-modified-since”:GET /pub/www/index.html HTTP/1.0If-Modified-Since: Sat, 19 Oct 2002 19:43:31 GMT

• Response:HTTP/1.1 200 OKServer: Microsoft-IIS/5.0Date: Thu, 13 Jul 2000 05:46:53 GMTExpires: Sun, 20 Oct 2002 16:00:00 GMTContent-Length: 2291Content-Type: text/htmlCache-control: private

What: Content and Protocols

• Example “if-modified-since”:

GET /pub/www/index.html HTTP/1.0If-Modified-Since: Sat, 19 Oct 2002 19:43:31 GMT

• Response:

HTTP/1.1 304 Not Modified

HTTP support for caching

• Conditional requests (IMS)• Servers can set expires and max-age • Request indirection: application level

routing• Range requests, entity tag • Cache-control header

– Requests: min-fresh, max-stale, no-transform

– Responses: must-revalidate, public, private, no-cache

Reverse

ProxyReverse

ProxyReverse

Proxy

Intranet

Where

Browser

Local ISP

cacheL4 Switch

Data Center

ISPcdn

cache

cache

Content

ServerContent

ServerContent

ServerContent

Server

Reverse

Proxy

Browsercache

Browsercache

cdn

Cache Types

• Proxy Caching• Reverse Proxy Caching• Transparent Caching

• Adaptive Caching

• Push Caching

• Active Caching

Proxy Caching

• Harvest/Squid

• Provide web content for a fixed user base

• Deployed at the network edges (company or institutional

gateway or firewall hosts)

• Standalone operation

• Manual configuration in web browsers

• Commodity product/technology

• Single point of failures

Reverse Proxy Caching

• Designed to offload duties from one

or more specific servers

• Data size is limited to size of static

content on the server

• Challenge is fast, disk-less operation

• Cache consistency is easy

Transparent Caching

• Intercept HTTP requests and redirect them to web

cache servers or cache clusters

• No client configuration

• Violates end-to-end paradigm

– Client thinks it is talking directly to server

– Server thinks it is talking to cache

• Implemented as: L4-switch

– Layer 4 switch makes switching decisions based on TCP

or UDP port number, i.e., 80

Transparent Caching

Adaptive Caching

• ISP Level caching, global data placement optimization

• Cooperating multiple distributed caches

• Operate as a cache-mesh based on content demand

• Cache Group Management Protocol – How meshes are formed

– How individual caches join/leave the meshes

• Content Routing Protocol sends request to the appropriate

cache within the meshes• Uses distributed cache meshes to solve the hot spot

problem• Caches dynamically join and leave the groups based on

content demand• Administrative boundaries must be relaxed

Push Caching

• Keep data close to those clients requesting this information

• Send the data out proactively• Assumption: we are able launch

caches that may cross administrative boundaries

• Incurs cost (storage and transmission)

Active Caching

• Applies caching to dynamic documents• 30 % of client HTTP requests contains

cookies• The servers provides the cache with

the objects and any associated cache applets

– Use an applet inside of the cache to

customize dynamic pages on the fly

Cache Placement/Deployment

• Close to clients/content consumers– Proxy caching– Transparent proxy caching

• Close to servers/content providers– Improve access to logical sets of data– Delay-sensitive data: video, audio– Reverse proxy caching– Push caching

• Network choke points: strategic deployment– Adaptive caching– Problem with administrative control

Zipf Law vs. Web Access

• Zipf Law• Web Access• Caching?

Zipf’s Law

• Zipf’s law: The frequency of an event P as a function of rank i is a power law function:

Pi = Ω / iα where α ≤ 1

Zipf’s Law

• Observed to be true for– Frequency of written words in

English texts– Population of cities– Income of a company as a function

of rank

Zipf’s Law vs. Web Access

• For a given server, page access by rank follows Zipf’s law

• Web requests from a fixed population of users follows Zipf’s law 0.64 < α < 0.83

Observations

• Top %1 of all documents account for %20 - %35 of proxy requests

• Top %10 account for %45 - %55 of requests

• It takes %25 to %40 of all documents to account for %70 of requests

• It takes %70 to %80 of all documents to account for %90 of requests

Zipf’s Law and Caching

Discussion

• How does this help in cache design?

Basic caching algorithm

Pages may be

• Fresh: up-to-date

• Expired: current date > expiration

date

• Stale: “old”

Basic caching algorithm - #2

If (page is in the cache)if ( page is expired or stale )

Get from server - if-modified-since

If not modified, Get from cache Get from ServerElse Get from Server

Basic caching algorithm - #3

If cache has spaceStore the file

Else1. Delete expired from cache2. Delete stale from cache3. Delete LRU from cache4. Delete largest/smallest from cache?

Cache Replacement

• Cache size is limited, need replacement policy

• LRU• LFU• Greedy-dual size• Many others

Cache Consistency

• Multiple copies of objects created– How and when renewing the copies?

• Goals– Avoid stale copies– Keep non useful traffic as low as possible

Cache Consistency: Polling

Solution 1: polling every time

implemented in HTTP using the optional “if-modified-since" request header field

Benefit: strong consistencyDrawback: very slow cache hit

Cache Consistency: PollingSolution 2: polling if TTL expires, widely

used– Associate a TTL (12 hours or 2 days) with each

cached object

implemented in HTTP using the optional "expires" header field

Benefit: fast cache hitDrawback: weak cache consistency (5% stale) due to TTL is an a priori estimate of an object's life time

Cache Consistency

• Solution 3 : Invalidation Protocols• The server helps the proxy in maintaining

consistency• Invalidation protocols

– When the proxy makes a request,• Piggyback cache validation (PCV) : the proxy provides some

other potentially stale copies for server validating• Piggyback cache invalidation (PCI) : the server provides

some copies which have been updated since last access– Use of volumes

• Volume lease :– The client receive a lease from the server– During the lease validity the client can retreive copies

from proxy– When the lease expire the client has to renew it

• Problems: scalability, servers needs keep cache states

Cache Cooperation

• Hierarchical caching– Cache servers form a hierarchy, tree-like

structures– Parent servers: top of the hierarchy, receive

requests from child servers. If they do not have the requested objects, either ask their parents or original web servers

– Sibling servers: if the local cache does not have the requested object, then ask its sibling caches. If the sibling caches do not have the object, then the local cache asks the parent cache

Cache Hierarchies• Use hierarchy to scale a proxy

– Why? • Larger population = higher hit rate (less compulsory

misses)• Larger effective cache size

– Why is population for single proxy limited?• Performance, administration, policy, etc.

• NLANR cache hierarchy– Most popular – 9 top level caches– Internet Cache Protocol based (ICP)– Squid/Harvest proxy

• How to locate content?

ICP (Internet cache protocol)

• Simple protocol to query another cache for content

• Uses UDP – why?• ICP message contents

– Type – query, hit, hit_obj, miss– Other – identifier, URL, version, sender address– Special message types used with UDP echo port

• Used to probe server or “dumb cache”

• Query and then wait till time-out (2 sec)• Transfers between caches still done using HTTP

Squid

Client

Parent

Child Child Child

Web page request

ICP Query

ICP Query

Squid

Client

Parent

Child Child ChildICP MISS

ICP MISS

Squid

Client

Parent

Child Child Child

Web page request

Squid

Client

Parent

Child Child Child

Web page request

ICP Query

ICP Query

ICP Query

Squid

Client

Parent

Child Child Child

Web page request

ICP MISS

ICP HIT

ICP HIT

Squid

Client

Parent

Child Child Child

Web page request

Hierarchical caching

• Ideally, want the cache mesh to behave as a single cache with equivalent capacity and processing capability

• ICP: many copies of popular objects created – capacity wasted

• High Latency: More than one hop needed for searching object

• How to improve? Discuss!

Problems with caching

• Over 50% of all HTTP objects are uncacheable.• Sources:

– Dynamic data stock prices, frequently updated content

– CGI scripts results based on passed parameters– SSL encrypted data is not cacheable

• Most web clients don’t handle mixed pages well many generic objects transferred with SSL

– Cookies results may be based on passed data– Hit metering owner wants to measure # of hits

for revenue, etc, so, cache busting

Risks of Using Proxy

• Benefits: reduce latency, bandwidth saving, etc.

• Risks– Obsolete data– Violate client privacy: the proxy can

keep a log file telling which objects the client has requested

– Data integrity

Real Proxy Servers• Squid: The most widely used. The better working and the

free one.• http://www.squid-cache.org/• Microsoft ISA Server 2004 : Microsoft developed ISA to

replace Microsoft proxy server. It’s fully functional with Active Directory

http://www.microsoft.com/isaserver/• Apache: Apache web server has a module to do reverse

caching (experimental) http://httpd.apache.org/docs-2.0/mod/mod_cache.html• Cisco Cache Engine: sits next to (mostly) Cisco routers and

receives transparently redirected HTTP requests http://www.cisco.com/warp/public/cc/pd/cxsr/500/index.shtml

• CERN/W3C HTTPd: It was the original proxy server. http://www.w3.org/hypertext/WWW/Daemon/Status.html

Recommended