Caching in a Nutshell

Since the beginning of the web, browsers have performed tricks to get as much performance out of the web as possible. One of the most effective tricks browsers have is caching web content. It works like this: the first time you browse to a given site, your browser downloads all the HTML, GIF images, JPEG images, etc. and displays them. But it also stores them for later, so the next time you visit that site, it can save the download time by just reloading them from your disk rather than reloading them from the internet.

In addition, some organizations run caching web proxies. These programs act as a middle man to the web. Each browser asks the proxy for all of its web pages instead of downloading them directly from the internet. Just like the browsers, the web proxy stores all the files downloaded from the web, and when a request comes in for something it has already seen, it sends it back to the browser without reloading it from the internet. By reducing the amount of duplicate information downloaded, the organization saves network bandwidth, and the web pages generally load faster for the users (it also provides the organization with a convenient method to monitor what its members are browsing, but thats a topic for another writeup).

The problem of dynamic content

For the above scheme to work correctly, unfortunately, the original web pages that have been copied into caches must never change. If they do, then browsers and proxies will end up serving content that is out of date.

Web documents generally fall into one of four categories:

  1. those that will be different each time its loaded (example: a page generated dynamically by a web application);
  2. those that change very often on a regular basis (example: the main page of a news site);
  3. those that change occasionally on an irregular basis (example: a personal homepage or an instruction manual);
  4. and finally those that never change (example: a company logo image, an archived article).
Each individual piece of the page can be in its own category: for example, a search page that always changes could contain a logo image that never changes. Some of these categories are great candidates for caching, like the static image. Some are terrible, like the dynamically generated web page. Unfortunately, its not always easy for the browser/proxy to know which is which.

Cache control

If the web server doesn't say anything about caching a particular page or image, the browser/proxy generally makes an educated guess based on a few rules. First, if it uses HTTP authentication or SSL, it will not be cached for security reasons. Also, if the server provides the date and time the page was last modified with the Last-Modified header (which most web servers automatically do), it will use a cached version if the page hasn't been modified for a while. Additionally, browsers are less likely to cache the results of a form submission or an URL that has parameters.

The HTTP specification allows the web server itself to tell the browser/proxy what it should really do. It can say that this page will only change at a given time, and so the cached version can be considered valid until then; by setting this time to a large enough interval, it can effectively say this page never changes. Or it can say this page shouldn't be cached at all.

The web server does this with some special, optional headers. The most important of these is Cache-Control. It can have the following values:

  • max-age=x - This means the browser or proxy can reuse the object in its cache for x seconds; after that, it must reload it to ensure it is fresh.
  • public - This means the page can be cached, even if it uses SSL or HTTP authentication; usually those pages are not cached for security reasons.
  • no-cache - Don't cache the page ever.
  • must-revalidate - This means follow the rules strictly; the HTTP specification gives browsers some leeway in deciding when pages should be reloaded; this tells them to follow the rules to the letter.
There are two other headers that provide similar functionality: Expires and Pragma. Expires is similar to setting the max-age with Cache-Control, except you specify the date and time until which the page is valid, not the interval. Pragma: no-cache is approximately equivalent to Cache-Control: no-cache; it was a convention used before HTTP/1.1 defined a proper way to do it; it may be useful for clients that still only understand HTTP/1.0 or lower.

Why You Should Control the Cache

By investing a little time and thought, you can reduce your bandwidth, improve the perceived performance, and ensure your users see up-to-date content. The time, bandwidth and server load saved not transferring redundant copies of static content may even improve performance of the dynamic elements elsewhere on your site. The question is not, "Why should I do this", it is, "Why haven't I already"!

A common worry of content providers is, "if I encourage caching, I won't have accurate statistics on how my website is accessed". This is mostly a non-issue, because you can still rely on the statictics from your dynamic content. You can also set up a small, non-cachable image on each of your pages specifically for this purpose.

Strategies for webmasters and programmers

First, don't use HTML META taga with http-equiv to generate these headers. Caching proxies usually only look at the HTTP headers, not the HTML content. They simply won't work with most proxies.

  1. Find out how to add these headers to your content:
    • Apache: Compile with --enable-module=headers and/or --enable-module=expires; add configuration directives to either access.conf or .htaccess.
    • MS IIS: select the web site in Administration Tools and bring up the properties. The two relevant options are Enable Content Expiration and Custom HTTP headers.
    • CGI scripts can print out the headers directly.
    • PHP can use the Header statement to generate the headers.
    • ColdFusion can use the CFHEADER tag.
    • ASP can set the Response.Expires and Response.CacheControl variables.
    • Java Servlets can use the HttpServletResponse.addHeader and HttpServletResponse.addDateHeader methods.
  2. Find out a way to view the page headers that are coming back:
    • This allows you to ensure your changes really work.
    • My personal favorite is http://webtools.mozilla.org/web-sniffer/.
  3. Determine which content is truly static and apply headers to reduce reloading:
    • Generally, images are good candidates, especially if they are used on multiple pages of the site.
    • Make sure you refer to your static content consistantly. Don't use different URLs to serve the same content; for example, don't embed a user id in the URL of content that is the same for every user.
    • By having a policy of giving updated content a new name you have your mostly static content never expire. For example if your company updates their logo every year or so, leave the old logo alone and put the new logo on the server with a new name. Then update all the pages on to point to the new logo. Or better yet, have the logo URL redirect to a real, versioned logo URL.
    • Example: the google logo (http://www.google.com/images/logo.gif) expires on January 17, 2038.
  4. Determine which content is truly dynamic and apply headers to indicate as such:
    • Cached dynamic pages are the source of many debugging nightmares when developing web applications. This whole category of problems can be fixed by ensuring dynamic content is not out of date.
  5. Determine which SSL or HTTP authenticated content is not sensitive and can be cached:
    • This can make a world of difference on secure sites; remember, browsers and proxies are not supposed to cache anything here, so the server is hit with a full reload each time it is accessed.
    • Cache-Control: public is your friend.