Linux DevCenter    
 Published on Linux DevCenter (http://www.linuxdevcenter.com/)
 See this if you're having trouble printing code examples


Ads in Cache-Friendly Pages

by Jennifer Vesperman
03/21/2002

You want to make your Web pages cache-friendly, but you're worried you'll lose advertising revenue. Ads are a reality in our current model of the Web, and it's important to ensure that they work and to keep track of how often they are served. Fortunately, it's easy to make ads work without losing cache friendliness in our pages.

To explain how to make ads uncacheable while having a cacheable page, I need to explain entities and selectively applying Cache-control headers. For a detailed description of Cache-control headers, see Cache-Friendly Web Pages.

The principles described here work for all the major elements in a Web page. Once you understand them, you can selectively apply cache headers to every entity on your pages.

Entities and cache control

HTTP is a client-server protocol. The client initiates the contact, asking the server to provide the entity described in the URI (Uniform Resource Identifier). A URI can be a location (a URL) or a name (a URN). URLs are the most common, but HTTP can handle either.

The server returns a response. The response consists of some HTTP headers and an "entity." The entity includes entity headers and a body that is (we hope) the requested data. Cache-control headers and Expires headers are entity headers, and they apply only to the entity they are included in.

Related Reading

Web Caching
By Duane Wessels

Inside the entity, especially if the entity is HTML, there may be URIs referring to additional entities. HTML image tags are the most common of these, and Web browsers read image tags and then send additional GET requests for the new entities to Web servers.

Any time you include a URI (relative or full) in your HTML, you may be referring to another entity. The exception is a fragment token, #foo, in the same entity it refers to. If stripping the fragment off would leave you pointing to the same page, it's the same entity. Other than that, every distinct URI refers to a distinct entity.

The browser (or other client) will make a separate request to pull down each entity. In the case of a link, it waits for the user to initiate the request. In the case of images, it usually initiates the request itself. (Some browsers do not request images, or do so only on user-initiated request.) Other included objects may or may not be automatically downloaded; please see the HTML specification to determine what the browser is expected to do with them.

All images, including images that are ads, are individual entities. And individual entities have their own, individual Cache-control or Expires headers. So we can have ads in cache-friendly Web pages without actually caching the ads.

This fact also allows us to have images with very long expiry times in frequently changing (and rapidly expiring) Web pages, and to have other elements uncacheable.

Setting ads to be uncacheable is one way of attempting to count the hits accurately. It's a rather unfriendly way of doing it, though; it forces the reloading of a (usually) static object, every time.

A cache-friendly way of serving ads

One way to make ads extremely cache-friendly and still be able to count the hits is to use cacheable images, but use "302: Found" or "307: Temporary Redirect" to reach the ad. The browser makes a GET request to the server, is redirected, and makes an additional GET request for the ad, which is served from the proxy. The server can count the redirects to count the hits on the ads.

307: Temporary Redirect is an entity header that requests the browser to send another GET request for the actual entity, but to ask for the original entity next time (unlike "301: Moved Permanently"). The URI to redirect to should be given by the Location field in the response and also in the content of the 307.

302: Found is also a suitable entity header for this purpose, but it's ambiguously managed in many browsers. A comment in RFC 2616 explains why it was split into "303: See Other" and "307: Temporary Redirect."

The advantage of this method is that the images themselves are cached, but you can still count the hits on the ad--and most caches do not cache redirects, so your hit-count is more accurate than for ads without Cache-control headers. The specification explicitly states that 307 is not cacheable unless stated otherwise.

You and your customers save bandwidth on the images, and your hit-count is as accurate as for images with No-cache headers. Everyone wins!

There is a latency cost, as the browser sends the GET request for the image, receives the redirect, and sends another GET request for the actual image. In most cases, this will be slight, especially if the browser is behind a cache that already contains the ad image.

If the image is cached, total bandwidth is reduced. If the image is not cached, total bandwidth is increased by the size of the second GET request and the 307--both extremely small.

Apache implementation

In the page that contains the ad, use an image tag that displays the correct height and width for the banner. The image tag should point to a URI that doesn't actually exist, but that is used to handle the redirect.

The redirect is managed by the mod_alias directive Redirect. Ensure that httpd.conf is set to load mod_alias. You may need to recompile Apache after editing this file.

#
# Dynamic Shared Object (DSO) Support
#

LoadModule expires_module /usr/lib/apache/1.3/mod_alias.so

The Redirect directive may be set in any of the document realms. See Cache-Friendly Web Pages for a description of Apache document realms and how to set directives within realms.

For our purpose, either a 307 or a 302 (temp) redirect will serve. Set the source to the fake URL the page refers to, and the second to the advertisement you actually want to display. The first path must be an absolute path from the current host. The second must be a full URL.

Redirect 307 /path/in/page http://host.and/path/to/advertisement

or

Redirect temp /path/in/page http://host.and/path/to/advertisement

The redirections will be logged in the access log. From there, count hits for each advertisement with a log-analysis script.

Perl implementation

If serving banners via a CGI script, you can write the redirection into the script. Return the Status and Location headers rather than returning the actual image.

In the page, include the image as a link to the CGI script, something like <img src="http://mybannerserver/banner.cgi">. This will call banner.cgi, which issues a redirect to the image that the browser then calls. You can then read your ad hits from your access logs, or record them as part of the script.

A sample banner.cgi, stripped to the barest minimum, is:

#!/usr/bin/env perl

	sub Pick_a_banner()
	{
		# Use whatever algorithm you prefer 
		# to choose the banner URI
		return "http://myserver/defaultbanner.gif"
	}

	$banner_uri=Pick_a_banner();
	print "Status: 307\n";
	print "Location: $banner_uri\n";
	print "\n";
	exit 0

Caveats and gotchas

If your advertisements are served from a Web server you have no access to, you may not be able to make the ads themselves cache-friendly. But you can still make your pages cache-friendly without affecting the ads.

Final words

This is just one method of making ads cache-friendly. The explanation of entities and the examples should let you modify your own system to be as cache-friendly as possible.

Further reading

Jennifer Vesperman is the author of Essential CVS. She writes for the O'Reilly Network, the Linux Documentation Project, and occasionally Linux.Com.


Return to the Linux DevCenter.

Copyright © 2009 O'Reilly Media, Inc.