Caching Tutorialby: Mark NottinghamWhat’s a Web Cache? Why do people use them? A Web cache sits between one or more Web servers (also known as origin servers) and a client or many clients, and watches requests come by, saving copies of the responses — like HTML pages, images and files (collectively known as representations) — for itself. Then, if there is another request for the same URL, it can use the response that it has, instead of asking the origin server for it again. There are two main reasons that Web caches are used:
Kinds of Web Caches Browser CachesIf you examine the preferences dialog of any modern Web browser (like Internet Explorer, Safari or Mozilla), you’ll probably notice a “cache” setting. This lets you set aside a section of your computer’s hard disk to store representations that you’ve seen, just for you. The browser cache works according to fairly simple rules. It will check to make sure that the representations are fresh, usually once a session (that is, the once in the current invocation of the browser). This cache is especially useful when users hit the “back” button or click a link to see a page they’ve just looked at. Also, if you use the same navigation images throughout your site, they’ll be served from browsers’ caches almost instantaneously. Proxy CachesWeb proxy caches work on the same principle, but a much larger scale. Proxies serve hundreds or thousands of users in the same way; large corporations and ISPs often set them up on their firewalls, or as standalone devices (also known as intermediaries). Because proxy caches aren’t part of the client or the origin server, but instead are out on the network, requests have to be routed to them somehow. One way to do this is to use your browser’s proxy setting to manually tell it what proxy to use; another is using interception. Interception proxies have Web requests redirected to them by the underlying network itself, so that clients don’t need to be configured for them, or even know about them. Proxy caches are a type of shared cache; rather than just having one person using them, they usually have a large number of users, and because of this they are very good at reducing latency and network traffic. That’s because popular representations are reused a number of times. Gateway CachesAlso known as “reverse proxy caches” or “surrogate caches,” gateway caches are also intermediaries, but instead of being deployed by network administrators to save bandwidth, they’re typically deployed by Webmasters themselves, to make their sites more scalable, reliable and better performing. Requests can be routed to gateway caches by a number of methods, but typically some form of load balancer is used to make one or more of them look like the origin server to clients. Content delivery networks (CDNs) distribute gateway caches throughout the Internet (or a part of it) and sell caching to interested Web sites. Speedera and Akamai are examples of CDNs. This tutorial focuses mostly on browser and proxy caches, although some of the information is suitable for those interested in gateway caches as well. Aren’t Web Caches bad for me? Why should I help them? Web caching is one of the most misunderstood technologies on the Internet. Webmasters in particular fear losing control of their site, because a proxy cache can “hide” their users from them, making it difficult to see who’s using the site. Unfortunately for them, even if Web caches didn’t exist, there are too many variables on the Internet to assure that they’ll be able to get an accurate picture of how users see their site. If this is a big concern for you, this tutorial will teach you how to get the statistics you need without making your site cache-unfriendly. Another concern is that caches can serve content that is out of date, or stale. However, this tutorial can show you how to configure your server to control how your content is cached. Side Note: On the other hand, if you plan your site well, caches can help your Web site load faster, and save load on your server and Internet link. The difference can be dramatic; a site that is difficult to cache may take several seconds to load, while one that takes advantage of caching can seem instantaneous in comparison. Users will appreciate a fast-loading site, and will visit more often. Think of it this way; many large Internet companies are spending millions of dollars setting up farms of servers around the world to replicate their content, in order to make it as fast to access as possible for their users. Caches do the same for you, and they’re even closer to the end user. Best of all, you don’t have to pay for them. The fact is that proxy and browser caches will be used whether you like it or not. If you don’t configure your site to be cached correctly, it will be cached using whatever defaults the cache’s administrator decides upon. How Web Caches Work All caches have a set of rules that they use to determine when to serve a representation from the cache, if it’s available. Some of these rules are set in the protocols (HTTP 1.0 and 1.1), and some are set by the administrator of the cache (either the user of the browser cache, or the proxy administrator). Generally speaking, these are the most common rules that are followed (don’t worry if you don’t understand the details, it will be explained below):
Together, freshness and validation are the most important ways that a cache works with content. A fresh representation will be available instantly from the cache, while a validated representation will avoid sending the entire representation over again if it hasn’t changed. How (and how not) to Control Caches There are several tools that Web designers and Webmasters can use to fine-tune how caches will treat their sites. It may require getting your hands a little dirty with your server’s configuration, but the results are worth it. For details on how to use these tools with your server, see the Implementation sections below. HTML Meta Tags and HTTP HeadersHTML authors can put tags in a document’s <HEAD> section that describe its attributes. These meta tags are often used in the belief that they can mark a document as uncacheable, or expire it at a certain time. Meta tags are easy to use, but aren’t very effective. That’s because they’re only honored by a few browser caches (which actually read the HTML), not proxy caches (which almost never read the HTML in the document). While it may be tempting to put a Pragma: no-cache meta tag into a Web page, it won’t necessarily cause it to be kept fresh. Side Note: On the other hand, true HTTP headers give you a lot of control over how both browser caches and proxies handle your representations. They can’t be seen in the HTML, and are usually automatically generated by the Web server. However, you can control them to some degree, depending on the server you use. In the following sections, you’ll see what HTTP headers are interesting, and how to apply them to your site. HTTP headers are sent by the server before the HTML, and only seen by the browser and any intermediate caches. Typical HTTP 1.1 response headers might look like this: HTTP/1.1 200 OK The HTML would follow these headers, separated by a blank line. See the Implementation sections for information about how to set HTTP headers. Pragma HTTP Headers (and why they don’t work)Many people believe that assigning a Controlling Freshness with the Expires HTTP HeaderThe Most Web servers allow you to set
The only value valid in an For example: Expires: Fri, 30 Oct 1998 14:19:41 GMT
Side Note: Although the Another problem with Cache-Control HTTP HeadersHTTP 1.1 introduced a new class of headers, Useful
For example: Cache-Control: max-age=3600, must-revalidate
If you plan to use the Validators and Validation
In How Web Caches Work, we said that validation is used by servers and caches to communicate when an representation has changed. By using it, caches avoid having to download the entire representation when they already have a copy locally, but they’re not sure if it’s still fresh. Validators are very important; if one isn’t present, and there isn’t any
freshness information ( The most common validator is the time that the document last changed, as
communicated in HTTP 1.1 introduced a new kind of validator called the ETag. ETags
are unique identifiers that are generated by the server and changed every time
the representation does. Because the server controls how the ETag is
generated, caches can be surer that if the ETag matches when they make a
Almost all caches use Last-Modified times in determining if an representation is fresh; ETag validation is also becoming prevalent. Most modern Web servers will generate both Tips for Building a Cache-Aware Site Besides using freshness information and validation, there are a number of other things you can do to make your site more cache-friendly.
Writing Cache-Aware Scripts By default, most scripts won’t return a validator (a Generally speaking, if a script produces output that is reproducable with the same request at a later time (whether it be minutes or days later), it should be cacheable. If the content of the script changes only depending on what’s in the URL, it is cacheble; if the output depends on a cookie, authentication information or other external criteria, it probably isn’t.
Some other tips;
See the Implementation Notes for more specific information. Frequently Asked Questions What are the most important things to make cacheable?A good strategy is to identify the most popular, largest representations (especially images) and work with them first. How can I make my pages as fast as possible with caches?The most cacheable representation is one with a long freshness time set. Validation does help reduce the time that it takes to see a representation, but the cache still has to contact the origin server to see if it’s fresh. If the cache already knows it’s fresh, it will be served directly. I understand that caching is good, but I need to keep statistics on how many people visit my page!If you must know every time a page is accessed, select ONE small item on
a page (or the page itself), and make it uncacheable, by giving it a suitable
headers. For example, you could refer to a 1x1 transparent uncacheable image
from each page. The Be aware that even this will not give truly accurate statistics about your users, and is unfriendly to the Internet and your users; it generates unnecessary traffic, and forces people to wait for that uncached item to be downloaded. For more information about this, see On Interpreting Access Statistics in the references. How can I see a representation’s HTTP headers?Many Web browsers let you see the To see the full headers of a representation, you can manually connect to the Web server using a Telnet client. To do so, you may need to type the port (be default, 80) into a separate
field, or you may need to connect to Once you’ve opened a connection to the site, type a request for the
representation. For instance, if you want to see the headers for
GET /foo.html HTTP/1.1 [return] Press the Return key every time you see My pages are password-protected; how do proxy caches deal with them?By default, pages protected with HTTP authentication are considered private; they will not be kept by shared caches. However, you can make authenticated pages public with a Cache-Control: public header; HTTP 1.1-compliant caches will then allow them to be cached. If you’d like such pages to be cacheable, but still authenticated for every
user, combine the Cache-Control: public, no-cache
Whether or not this is done, it’s best to minimize use of authentication; for example, if your images are not sensitive, put them in a separate directory and configure your server not to force authentication for it. That way, those images will be naturally cacheable. Should I worry about security if people access my site through a cache?SSL pages are not cached (or decrypted) by proxy caches, so you don’t have to worry about that. However, because caches store non-SSL requests and URLs fetched through them, you should be conscious about unsecured sites; an unscrupulous administrator could conceivably gather information about their users, especially in the URL. In fact, any administrator on the network between your server and your clients could gather this type of information. One particular problem is when CGI scripts put usernames and passwords in the URL itself; this makes it trivial for others to find and user their login. If you’re aware of the issues surrounding Web security in general, you shouldn’t have any surprises from proxy caches. I’m looking for an integrated Web publishing solution. Which ones are cache-aware?It varies. Generally speaking, the more complex a solution is, the more difficult it is to cache. The worst are ones which dynamically generate all content and don’t provide validators; they may not be cacheable at all. Speak with your vendor’s technical staff for more information, and see the Implementation notes below. My images expire a month from now, but I need to change them in the caches now!The Expires header can’t be circumvented; unless the cache (either browser or proxy) runs out of room and has to delete the representations, the cached copy will be used until then. The most effective solution is to change any links to them; that way, completely new representations will be loaded fresh from the origin server. Remember that the page that refers to an representation will be cached as well. Because of this, it’s best to make static images and similar representations very cacheable, while keeping the HTML pages that refer to them on a tight leash. If you want to reload an representation from a specific cache, you can
either force a reload (in Firefox, holding down shift while pressing ‘reload’
will do this by issuing a I run a Web Hosting service. How can I let my users publish cache-friendly pages?If you’re using Apache, consider allowing them to use .htaccess files and providing appropriate documentation. Otherwise, you can establish predetermined areas for various caching attributes in each virtual server. For instance, you could specify a directory /cache-1m that will be cached for one month after access, and a /no-cache area that will be served with headers instructing caches not to store representations from it. Whatever you are able to do, it is best to work with your largest customers first on caching. Most of the savings (in bandwidth and in load on your servers) will be realized from high-volume sites. I’ve marked my pages as cacheable, but my browser keeps requesting them on every request. How do I force the cache to keep representations of them?Caches aren’t required to keep a representation and reuse it; they’re only required to not keep or use them under some conditions. All caches make decisions about which representations to keep based upon their size, type (e.g., image vs. html), or by how much space they have left to keep local copies. Yours may not be considered worth keeping around, compared to more popular or larger representations. Some caches do allow their administrators to prioritize what kinds of representations are kept, and some allow representations to be “pinned” in cache, so that they’re always available. Implementation Notes — Web Servers Generally speaking, it’s best to use the latest version of whatever Web server you’ve chosen to deploy. Not only will they likely contain more cache-friendly features, new versions also usually have important security and performance improvements. Apache HTTP ServerApache uses optional modules to include headers, including both Expires and Cache-Control. Both modules are available in the 1.2 or greater distribution. The modules need to be built into Apache; although they are included in the distribution, they are not turned on by default. To find out if the modules are enabled in your server, find the httpd binary and run httpd -l; this should print a list of the available modules. The modules we’re looking for are mod_expires and mod_headers.
Once you have an Apache with the appropriate modules, you can use mod_expires to specify when representations should expire, either in .htaccess files or in the server’s access.conf file. You can specify expiry from either access or modification time, and apply it to a file type or as a default. See the module documentation for more information, and speak with your local Apache guru if you have trouble. To apply Here’s an example .htaccess file that demonstrates the use of some headers.
### activate mod_expires
Apache 2.0’s configuration is very similar to that of 1.3; see the 2.0 mod_expires and mod_headers documentation for more information. Microsoft IISMicrosoft’s Internet Information Server makes it very easy to set headers in a somewhat flexible way. Note that this is only possible in version 4 of the server, which will run only on NT Server. To specify headers for an area of a site, select it in the Administration Tools interface, and bring up its properties. After selecting the HTTP Headers tab, you should see two interesting areas; Enable Content Expiration and Custom HTTP headers. The first should be self-explanatory, and the second can be used to apply Cache-Control headers. See the ASP section below for information about setting headers in Active Server Pages. It is also possible to set headers from ISAPI modules; refer to MSDN for details. Netscape/iPlanet Enterprise ServerAs of version 3.6, Enterprise Server does not provide any obvious way to set Expires headers. However, it has supported HTTP 1.1 features since version 3.0. This means that HTTP 1.1 caches (proxy and browser) will be able to take advantage of Cache-Control settings you make. To use Cache-Control headers, choose Content Management | Cache Control Directives in the administration server. Then, using the Resource Picker, choose the directory where you want to set the headers. After setting the headers, click ‘OK’. For more information, see the NES manual. Implementation Notes — Server-Side Scripting Because the emphasis in server-side scripting is on dynamic content, it
doesn’t make for very cacheable pages, even when the content could be cached.
If your content changes often, but not on every page hit, consider setting a
Cache-Control: max-age header; most users access pages again in a relatively
short period of time. For instance, when users hit the ‘back’ button, if there
isn’t any validator or freshness information available, they’ll have to wait
until the page is re-downloaded from the server to see it. CGICGI scripts are one of the most popular ways to generate content. You can easily append HTTP response headers by adding them before you send the body; Most CGI implementations already require you to do this for the Content-Type header. For instance, in Perl; #!/usr/bin/perl Since it’s all text, you can easily generate print "Cache-Control: max-age=600\n";
This will make the script cacheable for 10 minutes after the request, so that if the user hits the ‘back’ button, they won’t be resubmitting the request. The CGI specification also makes request headers that the client sends
available in the environment of the script; each header has ‘HTTP_’ appended
to its name. So, if a client makes an HTTP_IF_MODIFIED_SINCE = Fri, 30 Oct 1998 14:19:41 GMT
See also the cgi_buffer
library, which automatically handles ETag generation and validation,
Server Side IncludesSSI (often used with the extension .shtml) is one of the first ways that Web publishers were able to get dynamic content into pages. By using special tags in the pages, a limited form of in-HTML scripting was available. Most implementations of SSI do not set validators, and as such are not
cacheable. However, Apache’s implementation does allow users to specify which
SSI files can be cached, by setting the group execute permissions on the
appropriate files, combined with the PHPPHP is a server-side scripting language that, when built into the server, can be used to embed scripts inside a page’s HTML, much like SSI, but with a far larger number of options. PHP can be used as a CGI script on any Web server (Unix or Windows), or as an Apache module. By default, representations processed by PHP are not assigned validators,
and are therefore uncacheable. However, developers can set HTTP headers by
using the For example, this will create a Cache-Control header, as well as an Expires header three days in the future: <?php Remember that the As you can see, you’ll have to create the HTTP date for an For more information, see the manual entry for header. See also the cgi_buffer library, which
automatically handles Cold FusionCold Fusion, by Macromedia is a commercial server-side scripting engine, with support for several Web servers on Windows, Linux and several flavors of Unix. Cold Fusion makes setting arbitrary HTTP headers relatively easy, with the
CFHEADER
tag. Unfortunately, their example for setting an <CFHEADER NAME="Expires" VALUE="#Now()#">
It doesn’t work like you might think, because the time (in this case, when the request is made) doesn’t get converted to a HTTP-valid date; instead, it just gets printed as a representation of Cold Fusion’s Date/Time object. Most clients will either ignore such a value, or convert it to a default, like January 1, 1970. However, Cold Fusion does provide a date formatting function that will do the job; GetHttpTimeSTring. In combination with DateAdd, it’s easy to set Expires dates; here, we set a header to declare that representations of the page expire in one month; <cfheader name="Expires" value="#GetHttpTimeString(DateAdd('m', 1, Now()))#">
You can also use the Remember that Web server headers are passed through in some deployments of Cold Fusion (such as CGI); check yours to determine whether you can use this to your advantage, by setting headers on the server instead of in Cold Fusion. ASP and ASP.NETSide Note: Active Server Pages, built into IIS and also available for other Web
servers, also allows you to set HTTP headers. For instance, to set an expiry
time, you can use the properties of the <% Response.Expires=1440 %>
specifying the number of minutes from the request to expire the representation. Likewise, absolute expiry time can be set like this (make sure you format HTTP date correctly): <% Response.ExpiresAbsolute=#May 31,1996 13:30:15 GMT# %>
<% Response.CacheControl="public" %>
In ASP.NET, Response.Cache.SetExpires ( DateTime.Now.AddMinutes ( 60 ) ) ; See the MSDN documentation for more information. References and Further Information HTTP 1.1 SpecificationThe HTTP 1.1 spec has many extensions for making pages cacheable, and is the authoritative guide to implementing the protocol. See sections 13, 14.9, 14.21, and 14.25. Web-Caching.comAn excellent introduction to caching concepts, with links to other online resources. On Interpreting Access StatisticsJeff Goldberg’s informative rant on why you shouldn’t rely on access statistics and hit counters. Cacheability EngineExamines Web pages to determine how they will interact with Web caches, the Engine is a good debugging tool, and a companion to this tutorial. cgi_buffer LibraryOne-line include in Perl CGI, Python CGI and PHP scripts automatically handles ETag generation and validation, Content-Length generation and gzip Content-Encoding — correctly. The Python version can also be used as a wrapper around arbitrary CGI scripts. © 2008 NetVisits, Inc. All rights reserved. |