O'Reilly    
 Published on O'Reilly (http://oreilly.com/)
 See this if you're having trouble printing code examples


A New Visualization for Web Server Logs

by Raju Varghese
02/08/2007

There are well over a hundred web server log analyzers (Google Directory for Log Analysis) or web statistics tools ranging from commercial offerings such as WebTrends to open source ones such as AWStats. These take web server logfiles and display numbers such as page views, visits, and visitors, as well as graphs over various time ranges. This article presents the same data in those logfiles in a very different way: as a 3D plot. By the end of this article, I hope you will agree with me that the visualization described herein is a novel and useful way to view the content of logfiles.

The logfiles of web servers record information on each HTTP request they receive, such as the time, the sender's IP address, the request URL, and the status code. The items in each request are fairly orthogonal to one another. The IP address of a client has no relation to the URL that it requests, nor does the status code of the request to the time of the request. If that is the case, what could be a better way to display these n columns from the logfiles than an n-dimensional plot?

When an administrator observes anomalous behavior on a web server, she reaches out for web statistics reports, as they are usually all there is as a record of past activity. These often prove fruitless, mainly because web statistics is primarily a marketing-oriented view of web server activity. The next step is to take the raw logfiles apart with ad hoc scripts. The sheer mass of data makes it difficult to reduce it to a few numbers that reveal the cause of the problem. Another complication is that you may not quite know what you are looking for other than that it is abnormal behavior. The path this article takes is to provide a visualization of raw data such that cause or causes make themselves visible. This comes from the real-life experience of a client, where crippling performance problems appeared out of nowhere.

The Plot

The scatter plot in Figure 1 shows more than half a million HTTP page requests (each request is a dot) in 3D space. The axes are:

3D scatter plot of a good day
Figure 1. Scatter plot showing HTTP requests

If the plotted parameters were truly orthogonal, you could expect a random distribution: a flat featureless plot. The parameters, however, are not completely independent of one another. For example, the IP ranges for Italy may prefer the Italian pages on the website. Therefore instead of a random plot, there are clusters in the 3D space. If you think about it, that does not seem unreasonable: the home page is probably the most visited page on a website. Studies (especially Jakob Nielsen on website popularity and Jakob Nielsen on traffic log patterns) argue convincingly that popularity closely follows Zipf's law: a log curve with a long tail. Hence the dense horizontal layer at the bottom in Figure 1. The vertical rectangular planes are search crawlers. They request pages over the whole content space from a small number of IP addresses and do that over the whole day. Therefore, clustering along each of the three dimensions is common.

The Case Study

The website of a client grew inexplicably sluggish one day. Since the web server, CMS, and auxiliary servers had run well for the preceding months, the only rational explanation pointed to an unusual request pattern. The web log-analysis reports showed nothing out of the ordinary. Command-line scripts (with awk, sort, grep, and friends) running over the logfiles also revealed no anomalies. I used Gnuplot to graph the requests in 3D space. (See also an excellent Gnuplot introduction) Some time later, the 3D plot made the culprit evident.

3D scatter plot of a bad day
Figure 2. Scatter plot of a bad day.

The thick pillar in the plot stands out like a sore thumb. This is a dense set of requests in a short time (about 100 minutes on the X-axis, which represents 24 hours) from a single IP address (Y-axis) and going over the whole content space (Z-axis). Why should it cause trouble? Large-scale CMS servers generate content on-the-fly from a database. Caches usually handle most requests, so only the small number of requests that are not currently in the cache should require database activity. On this particular CMS, the caches keep content for 15 minutes. When the client requested all of the pages in a short time, the high number of cache misses placed a heavy load on the database. This resulted in deteriorated performance. Search crawlers such as Yahoo Slurp and Googlebot do pretty much the same thing, but they spread the load over a much longer period.

The Process

Now that you have seen the output, here's how to generate it. The input is, of course, an access logfile that has lines of data, one per HTTP request. A typical line from an Apache server conforms to the NCSA combined access logfile standard. (See the Combined Log Format description at apache.org.) Note that I've wrapped the long line:

83.214.62.64 - - [15/Jan/2006:21:12:29 +0100] "GET
/index.php?level=2 HTTP/1.1" 200 5854 "http://www.sphere.li/index.php"
"Mozilla/5.0 (X11; U; Linux i6 86; en-US; rv:1.7.3)
Gecko/20040914"

The Perl script at the end of the article takes sequences of these lines and condenses them to just what Gnuplot needs. Run it with an access logfile and redirect it to an output file, such as gnuplot.input, from the command line:

$ perl prepare-for-gnuplot.pl access_log > gnuplot.input

The output will be a series of lines matching those of the access logfile. For the previous line from the access log, the corresponding output is:

15/Jan/2006:21:12:29 906 41 200

The fields in gnuplot.input, the output file of the Perl script, are date/time, ip rank (906), url rank (41), and status code.

To display the sequence of lines in Gnuplot, give it the commands:

$ gnuplot
set style data dots
set xdata time
set timefmt "%d/%b/%Y:%H:%M:%S"
set zlabel "Content"
set ylabel "IP address"
splot "gnuplot.input" using 1:2:3

Variations

If the plot is too dense--as was the case for me--thin it down by telling Gnuplot to only use every nth data point. For example, I thinned Figure 1 by plotting every tenth point with the Gnuplot splot command:

splot "gnuplot.input" using 1:2:3 every 10

Figure 3 shows the corresponding scatter plot.

Thinned 3D scatter plot of a good day
Figure 3. Thinned scatter plot

Gnuplot makes it easy to focus on a part of the plot by setting the axes ranges. Figure 4 shows a small part of the Y- and Z-axes. The almost continuous lines that run parallel to the time axis are monitoring probes that regularly request the same page. Four of them should be clearly visible. In addition, I changed the eye position.

Monitoring probes visible after reducing the Y and Z ranges.
Figure 4. Reduced Y and Z ranges showing monitoring probes

Because real people need sleep, it should be possible to make out the diurnal rhythms that rule our lives. This is evident in Figure 4. The requests are denser from 08:00 to about 17:00 and quite sparse in the early hours of the morning.

Changing the viewing angle can give you a new point of view. Gnuplot lets you do it in one of two ways: with the command line set view or interactively with a click and drag of the mouse.

The Pièce de Résistance

Because a display of 3D plots is difficult to see in three dimensions without stereoscopic glasses, I used a few more manipulations to "jitter" the image such that the depth in the picture is visible. The plot in Figure 5 is an example of this. It was easy to generate with more Gnuplot commands followed by GIF animation with ImageMagick.

An animated scatter plot
Figure 5. A animated GIF of the scatter plot that hints at the 3D structure

Further Work

With Gnuplot 4.2, which is still in beta, it is now possible to draw scatter plots in glorious color. Initial tests show that using color for the status code dimension makes the plots even more informative. Stay tuned.

Conclusion

Though the 3D plots present no hard numbers or trend lines, the scatter plot as described and illustrated above may give a more intuitive view of web server requests. Especially when diagnosing problems, this alternative way of presenting logfile data can be more useful than the charts and reports of a standard log analyzer tool.

Code Listings

The Perl script:

#
# prepare-for-gnuplot.pl: convert access log files to gnuplot input
# Raju Varghese. 2007-02-03

use strict;

my $tempFilename    = "/tmp/temp.dat";
my $ipListFilename  = "/tmp/iplist.dat";
my $urlListFilename = "/tmp/urllist.dat";

my (%ipList, %urlList);

sub ip2int {
        my ($ip) = @_;
        my @ipOctet = split (/\./, $ip);
        my $n = 0;
        foreach (@ipOctet) {
                $n = $n*256 + $_;
        }
        return $n;
}

# prepare temp file to store log lines temporarily
open (TEMP, ">$tempFilename");

# reads log lines from stdin or files specified on command line

while (<>) {
        chomp;
        my ($ip, undef, undef, $time, undef, undef, $url, undef) = split;
        $time =~ s/\[//;
        next if ($url =~ /(gif|jpg|png|js|css)$/);
        print TEMP "$time $ip $url $sc\n";
        $ipList{$ip}++;
        $urlList{$url}++;
}

# process IP addresses

my @sortedIpList = sort {ip2int($a) <=> ip2int($b)} keys %ipList;
my $n = 0;
open (IPLIST, ">$ipListFilename");
foreach (@sortedIpList) {
        ++$n;
        print IPLIST "$n $ipList{$_} $_\n";
        $ipList{$_} = $n;
}
close (IPLIST);

# process URLs

my @sortedUrlList = sort {$urlList {$b} <=> $urlList {$a}} keys %urlList; 
$n = 0;
open (URLLIST, ">$urlListFilename");
foreach (@sortedUrlList) {
        ++$n;
        print URLLIST "$n $urlList{$_} $_\n";
        $urlList{$_} = $n;
}
close (URLLIST);

close (TEMP); open (TEMP, $tempFilename);
while (<TEMP>) {
        chomp;
        my ($time, $ip, $url, $sc) = split;
        print "$time $ipList{$ip} $urlList{$url} $sc\n";
}
close (TEMP);

Raju Varghese has a Bachelors in Electrical Engineering from BITS, Pilani (India) and a Masters in Computer Science from the University of Texas, San Antonio.


Return to SysAdmin.com.

Copyright © 2009 O'Reilly Media, Inc.