52
Caching • Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to access – for processing a program, caching takes place in cache memory, which is either stored on the CPU, or on the motherboard • storage is typically for a very brief time period (fractions of a second) – for secondary storage, caching is stored in a buffer on the hard disk • storage is typically until there are new hard disk accesses – for web access, caching is stored on the hard disk itself • storage is typically for about a month if the information being stored is static (dynamic web content is usually not cached)

Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Embed Size (px)

Citation preview

Page 1: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Caching• Caching uses faster hardware to save information

(code or data) that you have used recently so that, if you need it again, it takes less time to access– for processing a program, caching takes place in cache

memory, which is either stored on the CPU, or on the motherboard• storage is typically for a very brief time period (fractions of a

second)

– for secondary storage, caching is stored in a buffer on the hard disk• storage is typically until there are new hard disk accesses

– for web access, caching is stored on the hard disk itself• storage is typically for about a month if the information being

stored is static (dynamic web content is usually not cached)

Page 2: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Controlling Browser Caches from Apache• Why wouldn’t you want your pages cached in the browsers

of the users who visit your website?– if the content is being modified often– if the content is dynamic

• typically browser caches will not cache dynamic content – if the file extension includes .php, .dhtml, or .cgi but this is not always the case

– if the web page causes cookies to be created or set– if information is sent to a specific users, for instance a page

created as a result of entering data into a form

• From apache, you can control how long items are cached by using the mod_expires module by using the HTTP Expires header– if the content is being modified, you can set an expiration date

of the last modification– alternatively, you might want to set an expiration date relative

to the date the items is sent (e.g., 3 days from now)

Page 3: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Expires Directives• ExpiresActive – controls whether you will send an expiration in

the header or not (values on or off)– just because you set this to on does not necessarily mean an

expiration will be sent with the header, see below• ExpiresByType type/encoding “<base> [plus] {<num>

<type>}*”– this is an algorithm for computing expiration time for a particular file

type• ExpiresDefault “<base> [plus] {<num> <type>}*”

– this is an algorithm for computing the expiration time for all other file types• <base> is either access or modification• plus is merely a keyword, optional• num is a number• type is the type of time unit (e.g., month, week, day, hour)

– examples:• ExpiresDefault “access plus 2 weeks”• ExpiresType text/html “access plus 1 day”• ExpiresType image/gif “access plus 1 week 3 days 6 hours”

Page 4: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Expiration Contexts• These directives can be placed in any context– in a container, the directives only impact files of that

container (e.g., that directory)– in an .htaccess, the directives only impact files of that

directory (or lower)– ExpiresType and ExpiresDefault can override earlier

definitions if placed in a lower context• e.g., if placed in <Directory /var/web/htdocs> and a later one is

placed in <Directory /var/web/htdocs/pub> then the later one overrides the earlier one, and if there is an .htaccess file in /var/web/htdocs/pub/foo, it overrides the earlier ones

• If you do not use a default and a file does not match the give type in an ExpiresType, then no expiration would be sent– in order to ensure that a document is resent every time, set

the expiration to 1 second (the least amount of time)• ExpiresDefault “access plus 1 second”

Page 5: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

SSI Caching• Recall that SSI is used to generate and/or insert dynamic

content into your web pages– Apache will not place a last modification date or content-

length in the HTTP header of any SSI page because these are difficult for Apache to determine, so by default, there is no date to compare against with respect to whether a document has expired (and thus, it will be assumed to have expired)

• However, if you want to permit SSI caching, it is possible because SSI can also be used to create the outlines for a page (e.g., using #include to include the navigation bar and footer)– two ways to do this:

• use the XBitHack Full directive which tells Apache to determine the last modification date of the SSI file, not any included files

• use the mod_expires and set the directives of the previous slides to a specified time for these particular files using a <Directory> or <Files> or <Location> container

Page 6: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Proxy Caches• Browser caches are useful for users who frequently

view the same web sites over and over– but for an organization, the browser caches cannot help

• that is, a browser cache is local to a single computer and not shared among multiple clients of the same site

• so the organization needs a cache that can extend across multiple users so that users who view the same web sites can obtain pages from the shared cache instead of having to wait for content to come across the Internet

• A proxy cache is one that extends across users of an organization– a proxy cache is part of a proxy server– the proxy server offers a cache for all users so that

commonly accessed content can be retrieved across the Internet once and then shared, improving network usage

Page 7: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Proxy Servers• A proxy server serves at least two functions– it offers an extended cache to the local users so that

multiple users who access the same pages get a savings– it offers control over what material can be brought into the

organization’s network and thus on to the clients• for instance, it can filter material for viruses• it can also filter material to disallow access to pornography, etc

– other functions that it can serve include• an authentication server• performing SSL operations like encryption and decryption• collecting statistics on web traffic and usage

– additionally, the proxy server can offer an added degree of anonymity in that it is the proxy server that places requests of remote hosts, not an individual’s computer• thus, the IP addresses sent to servers is that of the proxy server

not of the client

Page 8: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Forward vs Reverse Proxies• The typical form of proxy server is the forward proxy– a collection of browsers (on the same LAN, or within an

organization) share the same proxy server– all client requests go to the proxy server

• the server looks in its cache to see if the material is available• if not, the server looks to make sure that the request can be fulfilled

(does not violate any access rules), and sends the request over the Internet

• once a response is received, the server caches it and responds to the client

• A reverse proxy server is used at the server end of the Internet – responses from the Internet come into the proxy server which

then determines which web server to route the request on to– this might be used to balance the load of many requests for a

company that runs multiple servers– it also allows the proxy server to cache information and

respond directly if the requested page is in its cache• we’ll consider reverse proxy servers in a bit

Page 9: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Using Apache as a Forward Server• You can use your Apache web server as a forward proxy

server with little enhancement– you might do this if you already have a web server and your

organization uses it often (say for 50% of all web traffic)• Use the mod_proxy module and the <proxy> container– ProxyRequests on (or off – the default)– ProxyTimeout x (x is in seconds)– <Proxy URL>

• access directives here such as Deny from all, Access from 172.31.5.0/24 and/or authorization directives that require log ins

– </Proxy>• the URL can be a regular expression and matches based on the

outgoing URL, use * for “all http requests”

• NOTE: to use Apache as a forward server, you must also configure your browsers to work with a proxy server – see pages 332-337

Page 10: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Other Directives• NoProxy valuelist

– This directive specifies items (domains, subnets, IP addresses, hostnames) that should be handled by this server rather than proxied

– These items are separated by spaces as in 172.31.0.0/24 192 www.google.com• any URL that would be sent to one of these locations is instead redirected to

the location specified by a ProxyRemote directive• ProxyRemote * http://firewall.nku.edu

• ProxyBlock valuelist– Unlike NoProxy, the valuelist here can also include words or *,

words mean that if they appear anywhere in the URL, they are blocked

• ProxyVia on, off – the default, full, block)– If on, then any request that is redirected by proxy will have a “Via:”

line added to the header to indicate how the request was serviced– If full, then the server’s version is added to each Via line– If block, then remove all Via lines

• this differs from off because off only controls this server, other servers whose ProxyVia is on will still insert their own Via line, if set to block, all via headers are removed by this server

Page 11: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Additional Modules• Aside from mod_proxy, you might use any of these– mod_proxy_ajp – AJP support for mod_proxy

• AJP is the Apache Jserv Protocol which enhances performance by using a new binary format for packets, and adds SSL security

– mod_proxy_balancer – extension for load balancing• Your Apache proxy server can issue requests to back-end servers

(i.e., in a reverse proxy setting)• There are 3 load balancing algorithms:

– Request Counting – each server gets an equal number of requests– Weighted Traffic Counting – requests are distributed based on byte

count of the amount of work each server has recently handled– Pending Request Counting – distributed based on how many requests

are currently waiting for each server

– mod_proxy_connect – used for CONNECT request handling (connect is an HTTP method)

– mod_proxy_ftp, mod_proxy_http – ftp and http support for the proxy server

– mod_cache – as we discussed earlier in these notes

Page 12: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Reverse Proxy Uses• The reverse proxy works at the server end– One of its capabilities is to perform load balancing

• Additional features of the reverse proxy server are– Scrubbing – verification of incoming http requests to make

sure that each request is syntactically valid– Fault tolerance – as part of load balancing, if a server goes

down, the reverse proxy server can continue to maintain the incoming requests by reallocating the requests that the server was supposed to handle, and rebalancing the load to the remaining available servers

– HTTPS support – if the back-end web servers do not have the capability

– Redeployment – if a request requires a web application (e.g., execution of perl code), the request can be sent to a separate server that runs code

– Central repository – to cache static data for quick response time

Page 13: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

The Reverse Proxy Server

Page 14: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Apache as a Reverse Proxy Server• By default, Apache is configured to serve as a

reverse proxy server– the forward proxy server is controlled through

ProxyRequests • if you set it to off, Apache will function as a reverse proxy

server but not a forward proxy server

– the directives ProxyPass and ProxyPassReverse map incoming URLs to a new location no matter if that incoming URL is coming internally (from a client of this site that the Apache server is serving as a proxy) or externally (for a reverse proxy mapping)• ProxyPass /foo http://foo.example.com/bar• ProxyPassReverse /foo http://foo.example.com/bar

– now, any request received by this server for anything under directory /foo is sent (redirected) to foo.example.com/bar

Page 15: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Squid• Apache is not the best proxy server

– its main use is for a web server only– it does not provide the types of security and access control that

Squid has• Therefore, we will concentrate on Squid

– note: the textbook also discussed Pound but we will skip that• Squid is an open source proxy server

– its main use is as a forward proxy server but it can also be set up as a reverse proxy server

– its genesis is back with the original CERN HTTP server from 1994 which had a caching module

– the caching module was separated and it has evolved over time into Squid

• In these notes, we will look at installing and running a basic configuration for Squid along with setting up access control list directives to control access and content– we are going to skip over a lot of detail on Squid as there is not

sufficient time to cover it

Page 16: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Installing Squid• Squid source code for Linux or Windows can be

found at http://www.squid-cache.org/Download/– installing from the source code allows you to

configure Squid (much like we did with Apache)

• Before compiling, we will want to “tune” our kernel– recall that Linux limits the number of descriptors

available by your software (possibly to 64)– this wasn’t too critical in Apache unless we were

going to use it to run a lot of VHs– but it is important to Squid because Squid uses a file

descriptor for each request, so we will want to increase the number of descriptors available to Squid

Page 17: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Increasing File Descriptors• In most Unix systems, its easy, just use ulimit -n– ulimit –n unlimited– ulimit –n 8192 (or some other number)

• some Unix systems will use different commands or editing a config file

• In Linux its more complex– edit the file /usr/include/bits/typesizes.h and change the

entry #define __FD_SETSIZE 1024 to a larger number such as 4096 or 8192

– next, place that number in the file /proc/sys/fs/file-max (instead of editing that file, you could do echo 8192 > /proc/sys/fs/file-max)

– now, you can use ulimit with –Hn as in ulimit –Hn 8192• make sure the number you use is consistent in all three operations• when done, you do not have to reboot Linux, now you can

configure and compile Squid from source code

Page 18: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Configure Options• Similar to Apache, you can change many defaults in

Squid through the ./configure command– --prefix – same as in Apache– --sysconfdir, --localstatedir

• change the location of the configuration (from prefix/etc) and var (from prefix/var) directories, the var directory stores Squid’s log files and disk cache

– --enable-x • allows you to enable Squid-specific modules including

– gnuregex– carp (Cache Array Routing Protocol useful for forwarding cache

misses to an array)– pthreads– storeio (storage modules)– removal-policies– ssl, openssl

• full list is available at http://wiki.squid-cache.org/SquidFaq/CompilingSquid#configure_options

Page 19: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Squid Configuration• Once compiled and installed, running Squid is fairly

simple if you don’t want to make any changes to the configuration– the config file is squid.conf

• like httpd.conf, the file contains comments and directives• directives are similar to httpd.conf directives in that they all start

with the directive and are followed by zero or more arguments which can include for instance on/off, times (e.g., 2 minutes), IP addresses, filenames/paths, keywords such as deny, UNGET, etc

– we will study some of the configuration directives later– as with Apache, changing the conf file requires that you

restart Squid so that the file can be reread• although in Squid, you can keep Squid running and still have it

reread this file

– unlike Apache, Squid directives and values are case sensitive!

Page 20: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Initializing the Cache• Before running Squid, and whenever you want to add a

new cache directory, you must first initialize the cache directory(ies)– squid –z– this initializes all of the directories listed in the Cache variable

cache_dir• For this command to work successfully– you must make sure that the owner that Squid runs under

(probably squid) has read and write permission for each of the directories under cache_dir• when these directories are created, make sure they are either owned by

squid or that squid is in the same group as the owner– the name of the owner of these directories is established using

the cache_effective_user directive in squid.conf– you should start squid using the command su – squid

• this tells Squid to switch from root to squid as soon as it can (after dealing with root-only tasks)

Page 21: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Running Squid• You start Squid from the command line and control it much like

apachectl, but there are a lot of possible options, here we look at the most important– -a port – start Squid but have it listen to the port supplied rather than

the default port (3128), this also overrides any port specified in squid.conf using the http_port directive

– -f file – specify alternative conf file– -k function – perform administrative function such as reconfigure,

rotate, shutdown, debug or parse• parse causes Squid to read the conf file to test it for errors without reading it

to configure itself, this is useful for debugging your conf file– -s – enables logging to the syslog daemon– -z – initializes cache directories– -D – disable initial DNS test

• squid usually tests the DNS before starting– -N – keep Squid in the foreground instead of as a background

daemon• you might do this when first testing Squid so that you can see immediate

feedback printed to the terminal window, once debugged, kill Squid and rerun it without this option

Page 22: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Comments• If you want to run Squid upon booting

– you might add the start-up command to a script in rc.d, init.d or inittab

• Many people do not like running Squid in the main OS environment – for security purposes, just as you might not want to run apache in the

main OS environment, therefore they create a chroot environment– this is a new root filesystem directory separate from the remainder of

the filesystem– anyone who hacks into squid will not be able to damage your file

system, only the chroot environment• The safest way to shut down Squid is through

– squid –k shutdown• do not use kill

• To reconfigure Squid after changing squid.conf– run squid –k reconfigure, this prevents you from having to

stop/restart squid• To rotate Squid log files, use squid –k rotate

– put this in a crontab to rotate the files every so often (e.g., once a day)

Page 23: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

ACLs in Squid• Since apache can be used as a proxy server, you might

wonder why use squid?– squid allows you to define access control lists (acls) which in

turn can then be used to specify rules for access• who should be able to access web pages via squid?• what pages should be accessible? are there restrictions based on file

name? web server? web page content or size?• what pages should be cached?• what pages can be redirected?

– such rules are defined in two portions• acl definition (similar to what we saw when defining accessors in bind) • followed by an access statement (allow or deny statements)

– Squid offers a variety of acl definition types• IP addresses• IP aliases• URLs• User names (requiring authentication)• file types

Page 24: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Defining ACLs and Rules• Define access in two steps– first, define your ACL statements• simple definitions of a name to a specification

– such as calling a particular IP address “home” or using a regular expression to match against URLs and calling them “homenetwork”

• each acl contains a type that specifies what type of information you are using as a comparison, e.g., IP address, IP alias, user name, filename, port address, regular expression

– second, define a rule for how the ACL(s) is to be used• the rule will typically specify if this acl can or cannot gain

access through squid, for instance, if foo is a previously defined acl, then the following allows access– http_access allow foo

– you must define an acl before you use it in any rule

Page 25: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Example• The most common form of acl is to define and permit

access to specific clients– we will define some src (source IP address) acls

• typically with src, we define specific IP addresses or subnetworks (rather than IP aliases)

– acl src localhost 127.0.0.1• here, we define the source acl “localhost” to be the IP address 127.0.0.1

– acl src mynet 10.2/16• this could also be 10.2.0.0/16

• Now we use our acls to allow and deny access– http_access allow localhost– http_access allow mynet– http_access deny all

• here, we are allowing access only from localhost and those on “mynet”, everyone else is denied

• order of the allow and deny statements is critical, we will explore this next time

Page 26: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Types of ACLs• Aside from src, you can also specify ACLs based on– dst – the URL of the web server (destination)– srcdomain and dstdomain – same as src and dst except

that these permit IP aliases– srcdom_regex and dstdom_regex – same as srcdomain

and dstdomain except that the IP aliases can be denoted using regular expressions

– time – specify the times and days of the week that the proxy server allows or denies access

– port, method, proto – specify the port(s) that the proxy server permits access, the HTTP methods allowable (or denied) and the protocal(s) allowable (or denied)

– rep_mime_type – allow or deny access based on the type of file being returned• we will study these (and others) in detail next time

Page 27: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Types of ACLs• src – the IP address of the user (client) whose

requests are going from their browser to the squid proxy server

• srcdomain – the IP alias of the user• dst – the IP address of the requested URL on the

Internet• dstdomain – the IP alias of the request• myip – same as src, but it is the internal IP address

rather than (possibly) an external IP address • srcdom_regex, dstdom_regex – same as srcdomain

and dstdomain except that regular expressions are permissible

• arp – access controlled based on the MAC address

Page 28: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Comments• You can specify an IP alias using src or dst, but this

requires that squid use a reverse DNS lookup– it is best to use srcdomain/dstdomain if you want to

specify aliases instead of addresses

• When using srcdomain and dstdomain, you can specify part of the domain, such as .nku.edu or .edu instead of www.nku.edu– this is not true if you specify IP aliases using src and dst

• If you use src/dst, then after doing the reverse lookup one time, the value is cached – if the IP address were to change, squid would not be able

to find the computer in the future

Page 29: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

More ACLs• port – specify one or more port numbers– ranges separated by – as in 8000-8010– multiple ports are separated by spaces or on separate

definitions– typically, you will define “safe” ports and then disallow

access to any port that is not safe, for example:• acl port safe_ports 80 443 8080 3128• http_access deny !safe_ports

• method – permissible HTTP method – GET, POST, PUT, HEAD, OPTIONS, TRACE, DELETE– squid also knows additional methods including

PROPFIND, PROPPATCH, MKCOL, COPY, MOVE, LOCK, UNLOCK, CONNECT and PURGE• acl method allowable_method GET HEAD OPTIONS• http_access deny !allowable_method

Page 30: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

More ACLs• proto – permissible protocol(s) – http, https, ftp, gopher, whois, urn and cache_object• ex: acl proto myprotos HTTP HTTPS FTP

• proxy_auth – requires user login and a file/database of username/passwords– you specify the allowable user names here, such as• acl proxy_auth legal_users foxr zappaf newellg

• maxconn – maximum connections– you can control access based on a maximum number

of server connections– this limitation is per IP address, so for instance you

could limit users to 25 accesses, once the number is exceeded, that particular IP address gets “shut out”

Page 31: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Time ACLs• To control when users can access the proxy server, based

on either days of the week, or times (or both)– S, M, T, W, H, F, A for Sunday – Saturday, D for weekdays– time specified as a range, hh:mm – hh:mm in military time

• The format is acl name time [day(s)] [hh:mm - hh:mm]– example: to specify weekdays from 9 am to 5 pm:

• acl weekdays time D 09:00 – 17:00

– example: to specify Saturday and Sunday:• acl weekend time SA

• The first time must be less than the second– if you want to indicate a time that wraps around midnight,

such as 9:30 pm to 5:30 am, you have to divide this into two definitions (9:30 pm – 11:59 pm, and 12:00 am – 5:30 am)

– if days have different times, you need to separate them into multiple statements, such as wanting to define a time for M 3-7 and W 3-8 would require two definitions

Page 32: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Regular Expressions and more ACLs• As stated earlier, you can specify regular

expressions in srcdom_regex and dstdom_regex

• There are also regex versions to build rules for the URL– url_regex and urlpath_regex• for the full URL and the path (directory) portion of the

URL respectively– you might use this to find URLs that contain certain words, such

as paths that include “bin”, or paths/filenames that include words like “porn”

– ident_regex• to apply regular expressions to user names after the squid

server performs authentication

Page 33: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

User Names & Authentication• The ident acl can be used to match user names• The proxy_auth acl can specify either REQUIRED or

specific users by name that then require that a user log in– authentication requires that the user must perform a

username/password authentication before Squid can continue• any request that must be authenticated is postponed until authentication

can be completed

– although authentication itself adds time, using ident or proxy_auth also adds time after authentication has taken place because Squid must still look up the user’s name among the authentication records to see if the name has been authenticated

• Squid itself does not come with its own authentication mechanisms, so we have to add them as modules much like with apache

Page 34: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Other ACL Types• req_mime_type and rep_mime_type– test content-type in either the request or response header– it only makes sense to use req_mime_type when

uploading a file via POST or PUT– example: acl badImage rep_mime_type image/jpeg

• Browsers– restrict what type(s) of browser can make a request

• External ACLs– this allows Squid to sort of “pass the buck” by requesting

that some outside process(es) get involved to determine if a request should be fulfilled or not• external ACLs can include factors such as cache access time,

number of children processes available, login or ident name, and many of the ACLs we have already covered, but now handled by some other server

Page 35: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Matching Against ACLs• As we have seen, a single ACL can contain multiple items

to match against• ACL lists are “ORed” items – the ACL is true if there is a

match among any item in the list– to establish if an ACL is true, Squid works down the list of

items looking for the first match, or the end of the list• if a match is found, the ACL is established as true, otherwise that ACL

is established as false

– for example: acl Simpsons ident Lisa Bart Marge Homer• Squid will attempt to confirm that the user’s identity, as previously

established via authentication, matches any one of the items

– if you have a lot of ACLs and/or lengthy lists in ACLs, it is worthwhile ordering the entries based on most common to least common• imagine that Homer is the most common user, then move Homer’s

name to be first in the list, and if Bart is the least common user, move his name to the end

Page 36: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Types of Rules• The most common rule is the http_access rule– access is either allow or deny– if allow and the acl matches, then you are allowing the client to

have access, if deny and the acl matches, you are disallowing the client to have access

• You can also use http_reply_access– this allows the retrieved item to be let through the proxy server

back to the client, again you can use allow or deny• this rule allows you to supply definitions that can disallow items being

returned based on content (type, size, etc)

• You can control whether an item is cached or not using no_cache rules– here, the word “no” in a rule means “do not cache”

• it looks like a double negative: no_cache someACL no– you would use this to ensure certain pages do not get cached

(e.g., they have dynamic content, they aren’t worth caching, they are too large)

Page 37: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Matching Rules• Imagine an access rule says– http_access allow A B C D

• this means that all of A, B, C and D must be true for access to be allowed

• Squid will stop searching this rule after the first mismatch, so again, you might order these in this case from the least likely to the most likely to be more efficient (if A is usually true but C is seldom true, put C first)

– http_access deny A B C D• all must be true to deny access, if any are untrue, the rule is

skipped

• To create OR access rules, list each access rule sequentially as in– http_access allow A– http_access allow B

• now, if either A or B are true, access is allowed

Page 38: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Allow vs Deny Order• In Apache, you specified the order that allow and deny are

enforced using the Order directive– in Squid, the order is based strictly on the order of the rules as

they appear in your conf file

• In Apache, you would specify “deny from all” first and then override this with more specific “allow” statements

• In Squid, you do this in the opposite way– place an allow statement first, if the rule is true, then the

remainder of the rules are skipped– add a deny all type statement at the end to act as a default or

“fall through” case– you might define ALL to be everyone (e.g., IP address 0/0)– the deny all will look like this: http_access deny ALL

• You can specify multiple sets of rules, typically each set will contain allow statements and end with a deny ALL

Page 39: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Rule Organization• You have to place your allow and deny rules in a logical

manner for them to work– for instance, you would not do http_access deny All as the first

rule because it would be true of everyone and no other rules would be checked

• You will want to do is organize rules generally like this:– specific denial rules– specific acceptance rules– http_access deny All

• In this way, if a particular situation fits both a denial and acceptance rule, the access is denied– for instance, a request may be acceptable because it has the

proper src IP address, but it is during the wrong time of day, so it should ultimately be denied

– by reversing the order of the denial and acceptance rules, the request would be fulfilled because as soon as it is accepted, access is allowed an no further rules are considered

Page 40: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Common Scenarios• Allowing only local clients– acl ALL src 0/0– acl MyNetwork src 172.31/16– http_access allow MyNetwork– http_access deny ALL

• Blocking a few clients (assume ALL and MyNetwork are as defined above)– acl ProblemHosts 172.31.1.5 172.31.1.6 172.31.4/24– http_access deny ProblemHosts– http_access allow MyNetwork– http_access deny ALL

• notice the ordering here, since MyNetwork is more general than ProblemHosts, we first deny anyone specifically in ProblemHosts, then we allow access to those in MyNetwork that were not in ProblemHosts

Page 41: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

More• Denying access to any URL that looks like it might

contain pornography– acl PornSites url_regex –i porn nude sex [add more terms

here]– http_access deny PornSites– http_access allow ALL

• here we allow anyone access if the URL does not include the list of PornSite words

• We might want to add to this a refusal to accept replies that contain movie or image files– acl Movies rep_mime_type video/*– acl Images rep_mime_type image/*– http_reply_access deny Movies– http_reply_access deny Images– http_reply_access allow ALL

Page 42: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

And More• Here, we restrict access to be working hours and our

own site (disallow access to URLs off site)– acl WorkHours D 08:30-17:30– acl OurLocation dstdomain “/usr/local/squid/etc/ourURLS”– http_access allow WorkHours OurLocation– http_access deny ALL

• And here is an example to permit only specific port accesses– acl SafePorts port 80 21 443 563 70 210 280 488 591 777

1025-65535– acl SSLPorts port 443 563– acl CONNECT method CONNECT– http_access deny !SafePorts– http_access deny CONNECT !SSLPorts– http_access allow ALL

Page 43: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Redirectors• A redirector is similar to the rewrite rules and redirection

used by apache– here however, we are redirecting an internal request before it

leaves the proxy server• in apache, we redirect an external request to either a new internal

location/file or to a new external resource

– this permits • access control• the remove of advertisement• local mirroring of resources• working around browser bugs

– with access control, you can even send the user to a page that explains why they were rerouted

• A redirector is just a program that reads a URI (along with other information) and creates a new URI as output– redirectors are often written in Perl or Python, or possibly C

Page 44: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

How to Use a Redirector• To apply a redirector in squid, issue one or more of

the following directives– redirect_program specifies the external program to run

(the redirector)– redirect_children specifies how many redirector processes

Squid should start– redirect_rewrite_host_header will update a request header

to specify that a redirector is being used– redirect_access allows you to specify rules that decide

which requests to send to redirectors • without this, every request to Squid is sent to a redirector to

check to see if it should be redirected

– redirector_bypass will bypass a redirector if all spawned redirectors are currently busy, otherwise the requests begin to stack up and wait

Page 45: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

How to Write a Redirector• A redirector receives four pieces of input– request URI, including any query terms (after the ?)– client IP address (and optionally domain name)– user’s name or proxy authentication– HTTP request method

• Your redirector code will consist of rules that– investigate parts of the input to see if any of the redirector

rules match to the input• if a rule matches, then the redirector code will produce output

which will be a new URI, redirecting the request• an example might be to search for any URL being sent to an IP

address in China, and rewrite the query to a mirror site that exists in Taiwan

• another example is to search the URI for certain “bad words” and if any are found, redirect the request to a local page that explains why any such requests are not being allowed

Page 46: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Redirector Code• Aside from building a new URI for the request, it

can also alter components of a response header• The redirector code may involve – database queries– searching the URI for specified regular expressions– complex computations – invoking other programs

• thus, a redirector can take a long time to respond, this would slow squid’s processing down

• this is one reason why the bypass directive is available, you don’t necessarily want to penalize everyone because of redirections taking too much time

• Redirector code is commonly written in Perl but it can be written in other scripting languages

Page 47: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Authentication Helpers• As with Apache, Squid does not have built-in

mechanisms for handling password files– so Squid turns to authentication helpers

– Squid supports three forms of authentication, the first two are similar to Apache• Basic, Digest, NTLM (this is an MS authentication protocol)

– for each of these, you have to download the software and compile it and then configure Squid to use the helper• we already visited how to write ACL and http_access directives

that use authentication, so we skip it here

– basic authentication helpers come with Squid, you can use NCSA (simple), LDAP, MSNT (for MS NT databases), NTLM, PAM, SASL (which includes SSL), winbind or others

Page 48: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Log Files• As with Apache, Squid uses log files to store

messages of importance and to maintain access and error logs– however, one additional log that Squid has that Apache

does not is a cache log in order to record what files are cached

– there are also optional log files available• useragent.log and referer.log which contain information about

user agent headers and web referers for every access• swap.state and netdb_statestore information regarding the disk

and network performance of Squid– you can control the names of the log files and which of

these optional log files are used through directives in your conf file

– because there are so many logs and they can generate a lot of content, there are log rotation tools available just as with Apache

Page 49: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

cache.log• This log contains – configuration information– warnings about performance problems– errors

• Entries are of the form– date time | message

• Configuration messages might include such things as – process ID of a starting squid process– successful (or failed) tests to the DNS and the DNS IP

address (as obtained from resolv.conf)– starting helper programs

• The remaining cache entries are made based on a specified debug level that dictate which types of operations should be logged here– normal information, warnings, errors, emergencies, etc

Page 50: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

access.log• Much like Apache’s access log, Squid’s access log will store

every request received– each entry contains 10 pieces of information

• timestamp• response time• client address• status code of request• size of file transferred• HTTP method• URI• client identity (if available)• how requests were fulfilled on a cache miss (that is, where we had to go to

get the file)• content type

– status codes differ from Apache as they indicate cache access as well as server status codes, and include these:• TCP_HIT, TCP_MISS, TCP_REFRESH_HIT, TCP_REF_FAIL_HIT,

TCP_REFRESH_MISS, TCP_CLIENT_REFRESH_MISS, TCP_IMS_HIT, TCP_SWAPFAIL_MISS, TCP_NEGATIVE_HIT, TCP_MEM_HIT, TCP_DENIED, TCP_OFFLINE_HIT, TCP_REDIRECT and NONE

Page 51: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Directives for access.log• log_icp_queries – default is enabled, allows you to control

whether ICP (Internet Cache Protocol) requests are logged or not

• emulate_http_log – whether to use the same format as http server access logs (that is, match Apache’s server log) or use Squid’s native format which contains more information

• log_mime_hdrs – if set to on, Squid will add HTTP request and response headers to each log entry (this adds two more fields to each entry)

• log_fqdn – this toggles whether Squid records requests by destination IP address or hostname – if hostname, then Squid has to do a reverse DNS lookup which takes more time

• log_ip_on_direct – same as above except whether to log client’s (requestor’s) IP address or hostname

• strip_query_terms, uri_whitespace – whether to remove the query terms from an URL and whether to strip, chop, or encode white space in a URL (if any)

Page 52: Caching Caching uses faster hardware to save information (code or data) that you have used recently so that, if you need it again, it takes less time to

Store.log• The store.log file stores decisions to store and remove

objects from the Squid cache– if an object is cached, the entry includes where it was

cached and when– if an object is uncacheable, then the entry indicates why

the object was uncacheable– if a cache is full, a replacement strategy is used to decide

what to remove, and any such action is logged here

• The store log contains the following fields:– timestamp, action (SWAPOUT, RELESE, SO_FAIL),

directory number (which cache), file number, cache key (the hash value of the object), status code, date, last_modified from the HTTP response header, expires, content-type, content-length/size, HTTP method and URI