Squid ( web-cache proxy )

Viralator (http://viralator.sourceforge.net/ - Squid + Perl skript)
squid-vscan (http://sourceforge.net/project/showfiles.php?group_id=10590&release_id=68273)
HAVP + Squid (http://www.server-side.de/ideas.htm)
How To Set Up A Caching Reverse Proxy With Squid 2.6 On Debian Etch

How can I configure squid so that it never caches some web sites?

Add the following line in /etc/squid/squid.conf:

 acl NOCACHEDOMAIN dstdomain www.redhat.com
 no_cache deny NOCACHEDOMAIN

It will not cache any content come from the domain www.redhat.com. In /var/log/squid/access.log, it will show the page from that domain will always get “TCP_MISS” on consecutive visits.

 1197363963.721    892 127.0.0.1 TCP_MISS/200 11813 GET http://www.redhat.com/ - DIRECT/209.132.177.50 text/html
 1197364100.832    906 127.0.0.1 TCP_MISS/200 11813 GET http://www.redhat.com/ - DIRECT/209.132.177.50 text/html

Speed up your Internet access using Squid's refresh patterns

Refresh patterns determine what is saved and served from the cache. Ideally, you would want your squid to follow the directions of the Web servers serving the content to determine what is cacheable and for how long. These directions are set as HTTP headers that are processed and understood by Squid. Unfortunately, the directions given by most servers are the Web servers' defaults, and do not produce significant bandwidth savings.

Refresh patterns are of the format:

 refresh_pattern [-i] regex min percent max [options]

where min and max are time values in minutes and percent is a percentage figure. The options are:

override-expire – ignores the expire header from the Web server.
override-lastmod – ignores the last modified date header from the Web server.
reload-into-ims – a reload request from a client is converted into an If-Modified-Since request.
ignore-reload – a client's no-cache or “reload from origin server” directive is ignored. The request can therefore be satisfied from the cache if available.
ignore-no-cache – a no-cache directive from the Web server which makes an object non-cacheable is ignored.
ignore-no-store – a no-store directive from the Web server which makes an object non-cacheable is ignored.
ignore-private – a private directive from the Web server which makes an object non-cacheable is ignored.
ignore-auth – objects requiring authorisation are non-cacheable. This option overrides this limitation.
refresh-ims – a refresh request from a client is converted into an If-Modified-Since request.

Consult your configuration file to see which of these options are available in your version of Squid.

Refresh patterns are effective if there is no expire header from the origin server, or your refresh pattern has an ignore-expire option. Example:

 refresh_pattern -i \.gif$ 1440 20% 10080.

This says:

If there is no expire header for all objects whose names end in .gif or .GIF (that is, image files) then:
if the age (that is how long the object has been on your cache server) is less than 1,440 minutes, then consider it fresh and serve it and stop
else if the age is greater than 10,080 minutes, consider it stale and go to the origin server for a fresh copy and stop
else if the age is in between the min and max values, use the lm-factor to determine freshness. lm-factor is the ratio of the age on your cache server to the period since creation or modification of the object on the origin server as a percentage. So if the object was created 10,000 minutes ago on the origin server and it has been on my cache server for 1,800 minutes (that is the age) the lm-factor is 1,800/10,000 = 18%.
If the lm factor is less than the percent in our refresh pattern (20%) then the object is considered fresh; serve it and stop
else the object is stale, go for a fresh copy from the origin server.

For objects that scarcely change under the same file name, such as video, images, sound, executables, and archives, you can modify the refresh pattern to consider them fresh on your Squid for a longer time, increasing the probability of having hits. For example, you could modify our refresh pattern above to:

 refresh_pattern ^ftp: 1440 20% 10080 
 refresh_pattern ^gopher: 1440 0% 1440 
 refresh_pattern -i \.(gif|png|jpg|jpeg|ico)$ 10080 90% 43200 ignore-expire ignore-no-cache ignore-no-store ignore-private 
 refresh_pattern -i \.(iso|avi|wav|mp3|mp4|mpeg|swf|flv|x-flv)$ 43200 90% 432000 ignore-expire ignore-no-cache ignore-no-store ignore-private 
 refresh_pattern -i \.(deb|rpm|exe|zip|tar|tgz|ram|rar|bin|ppt|doc|tiff)$ 10080 90% 43200 ignore-expire ignore-no-cache ignore-no-store ignore-private 
 refresh_pattern -i \.index.(html|htm)$ 0 40% 10080 refresh_pattern -i \.(html|htm|css|js)$ 1440 40% 40320 
 refresh_pattern . 0 40% 40320

By default, Squid will not cache dynamic content. Dynamic content is determined by matching against either “cgi-bin” or “?”. This feature used to be activated via the “hierarchy_stoplist” and “cache deny” settings in older versions of Squid. In recent versions, starting with 3.1, this feature is activated via a refresh pattern such as refresh_pattern (/cgi-bin/|\?) 0 0% 0. This enables you to specify sites that serve dynamic content that could be made cacheable in bypass rules. For example, you could set up a refresh pattern such as:

 refresh_pattern -i movies.com/.* 10080 90% 43200 refresh_pattern (/cgi-bin/|\?) 0 0% 0

For the older versions of Squid, you will have to define an access control list (ACL) for the content providers you wish to make exceptions for, and use cache accept to exempt it before the cache deny rule. The following example is from the Squid wiki:

 # Let the client's favourite video site through 
 acl youtube dstdomain .youtube.com cache allow youtube 
 # Now stop other dynamic stuff being cached 
 hierarchy_stoplist cgi-bin ? 
 acl QUERY urlpath_regex cgi-bin \? 
 cache deny QUERY

Below, we configure one global delay pool at 64Kbps (8KBps). Traffic for which the ACL of destination domain is windowsupdate.com during the peak period of 10:00-16:00 will be limited to 64Kbps.

 acl winupdate dstdomain .windowsupdate.com 
 acl peakperiod time 10:00-16:00 
 delay_pools 1 
 delay_class 1 1 
 # 64 Kbit/s 
 delay_parameters 1 8000/8000 
 delay_access 1 allow winupdate peakperiod

After making changes like the ones above, my Squid's byte hit rate increased from about 8% to between 26-37%. If you are doing 33%, it means a third of all traffic is coming from your cache, and not from slower links across the Internet.

Proxy AIM, MSN, Gtalk, ..

To proxy and to allow AIM, MSN, Yahoo and GTalk Instant Messenger traffic via with Squid, change/add the following line in the Squid configuration file.

# Allow AIM protocols

acl AIM_ports port 5190 9898 6667
acl AIM_domains dstdomain .oscar.aol.com .blue.aol.com .freenode.net
acl AIM_domains dstdomain .messaging.aol.com .aim.com
acl AIM_hosts dstdomain login.oscar.aol.com login.glogin.messaging.aol.com toc.oscar.aol.com irc.freenode.net
acl AIM_nets dst 64.12.0.0/255.255.0.0
acl AIM_methods method CONNECT
http_access allow AIM_methods AIM_ports AIM_nets
http_access allow AIM_methods AIM_ports AIM_hosts
http_access allow AIM_methods AIM_ports AIM_domains

# Allow Yahoo Messenger

acl YIM_ports port 5050
acl YIM_domains dstdomain .yahoo.com .yahoo.co.jp
acl YIM_hosts dstdomain scs.msg.yahoo.com cs.yahoo.co.jp
acl YIM_methods method CONNECT
http_access allow YIM_methods YIM_ports YIM_hosts
http_access allow YIM_methods YIM_ports YIM_domains

# Allow GTalk

acl GTALK_ports port 5222 5050
acl GTALK_domains dstdomain .google.com
acl GTALK_hosts dstdomain talk.google.com
acl GTALK_methods method CONNECT
http_access allow GTALK_methods GTALK_ports GTALK_hosts
http_access allow GTALK_methods GTALK_ports GTALK_domains

# Allow MSN

acl MSN_ports port 1863 443 1503
acl MSN_domains dstdomain .microsoft.com .hotmail.com .live.com .msft.net .msn.com .passport.com
acl MSN_hosts dstdomain messenger.hotmail.com
acl MSN_nets dst 207.46.111.0/255.255.255.0
acl MSN_methods method CONNECT
http_access allow MSN_methods MSN_ports MSN_hosts