Bayesian Filtering with bogofilter and Sylpheed Claws
by Oktay Altunergil01/30/2003
In August 2002, Paul Graham published a paper suggesting that Bayes' probability theorem (see Resources) applied to the spam emails we receive. The gist of Graham's paper is that each word you receive in your emails -- including those that make up the email header -- carry a spam value of 0 to 1. This number is calculated by studying a large number of emails that are known to be spam versus another set of emails that are known to be legitimate. If a particular word only appears in spam emails, there is a high probability that the next time you see this word in an email message, it will be part of a spam message. Similarly if a word, such as your secret nickname that only a few people know or the From: address of a coworker, tends to appear only in good emails, that word will have a higher probability of being present in a non-spam email message. Of course, we should score all of the words in a message and get an average "spam probability value" for the whole message so that an email from a friend trying to let you know about "a great business opportunity !!" does not go into your trash bin or a spam email about "how to copy your DVDs" don't go into your good email folder just because it addressed you by your first name.
What makes Bayesian filtering special is that false positives -- legitimate emails marked as spam -- are very rare. As Graham points out, spammers can fool every system we put in place, but they still have to deliver their commercial message. This message is exactly what causes them to shoot themselves in the foot. It is trivial to recognize spam email if you take a quick look at the subject and the message body. This action can be emulated very successfully using a Bayesian filter that learns on your behalf, applying acquired knowledge to your future emails. If you notice the filter is making a mistake, you can teach it to not do the same thing again. After a very short while, the filter will be almost bullet proof.
Shortly after Graham's article, a number of people implemented spam filters that use the Bayesian algorithms. For this article we will look at bogofilter written by Eric S. Raymond. We have chosen bogofilter because of its speediness, which arises from its being written in C and using BerkeleyDB as its storage facility, as opposed to a plain text file. As long as we're picking software based on speed, I decided it would only make sense to pick Sylpheed (of the "claws" variety) as our email client to demonstrate bogofilter. (See my previous article about Sylpheed and Sylpheed claws.)
|
Related Reading
Linux Server Hacks |
Installing bogofilter
It's fairly simple to configure and install bogofilter. You can either download the latest source package or find a package for your operating system. The current latest version 0.9.0.5 is available as an RPM or FreeBSD package. The Gentoo distribution also has an ebuild for it in its portage package collection.
If you will be installing it from the source package, all you have to
do is download it in a temporary directory, decompress it and run
./configure && make then make install as root in
the uncompressed source directory. Coincidentally, these are the generic
instructions to configure, compile and install a source package on Unix
and Linux systems. If something goes wrong, I suggest asking for
assistance from somebody with adequate experience. Often everything will
go as planned and the installation procedure will create the program
binaries and put them in /usr/bin/. It will also create a
sample configuration file (with which you need not concern yourself) and
place it in the /etc directory.
By default, bogofilter keeps its data in two database files called
goodlist.db and spamlist.db. These files are
stored in a .bogofilter directory in the user's home
directory. You need not create the directory or the files explicitly since
they will be created by bogofilter while training it.
Training bogofilter
As mentioned above, bogofilter, like all other Bayesian filters, does
its magic based on the principles of probability. For this reason you need
a archive of spam and non-spam emails. The more emails you have gathered,
the finer tuned your filter will be. I normally just ignore spam emails
instead of deleting them, so for me it wasn't very difficult to find
hundreds of spam emails in my incoming email directory in Sylpheed. We
will create two mail directories in Sylpheed and call one of them
SPAM and the other NONSPAM. If you disinfect
your regular incoming email directory by removing each and every spam
message, you can do without a dedicated NONSPAM directory. If
you choose to do so, make sure you keep this incoming directory free of
spam in the future too.
Before starting to train your bogofilter, make sure there's at least 100 emails in each folder. This should be a nice quantity and variety. If you don't have enough spam messages (if you delete them as you receive them or if you don't receive any -- those were the days!) , you can download a batch of spam messages from a Bayesian spam filtering web site. I recommend against doing this since every individual receives a different variety of spam messages and what looks like spam to somebody else might actually be something you receive as good mail regularly. (Many people confuse spam with emails they once asked to receive but don't want anymore.) You will find that the spam accumulated over a few days will be enough to tune your filter. Better yet, keep training it as you go along. The result is a highly customized personal filter that will allow bogofilter to think and act just like you would.
We will start with training bogofilter to recognize spam words. In
order to do this we will start a shell and go into the SPAM
directory. By default Sylpheed keeps its emails in a Mail
directory in the user's home directory. This directory contains all spam
messages, each in its own file. Sylpheed uses an identifying number for
each filename. The directory resembles:
grog oktay # cd ~/Mail/SPAM
grog SPAM # ls
1 108 117 126 135 144 153 162 171 180 19 199 26
We will need to feed the whole message text, header and body, into the
bogofilter command and mark them as spam by using the
-s option. Since the number of messages is irrelevant to the
Bayesian algorithm, we can run the command in one of two ways.
The following command feeds all spam messages into bogofilter at
once. The -v option increases the verbosity of the command
and prints out some useful information.
grog SPAM # cat * | bogofilter -s -v
# 93861 words, 3 messages
We can also invoke the bogofilter command one at a time and have bogofilter process them individually as can be seen from the partial output below.
grog SPAM # for i in *; do echo Processing Mail ID \#$i; \
bogofilter -s -v < $i ; done;
Processing Mail ID #1
# 279 words, 1 message
Processing Mail ID #10
# 113 words, 1 message
Processing Mail ID #100
# 498 words, 1 message
Processing Mail ID #101
# 685 words, 1 message
Processing Mail ID #102
Whichever method you use, bogofilter will create the
.bogofilter directory as well as a spamlist.db
database file. Please do not access this or the goodlist.db
file directly as they are both in a binary format. Repeat the above steps
in the ~/Mail/NONSPAM directory to create the non-spam list
database. Since these are non-spam files, you will need to
substitute the -s option for the -n
option such that the command is now bogofilter -n
-v. If everything goes as planned, you will now have both the good
words list goodlist.db and the spam words list
spamlist.db. We're ready to filter out spam.
Marrying bogofilter to Sylpheed
If you run bogofilter manually on a bunch of text (i.e., an email
message), it will return either 0 or 1 depending on whether the email is
found to be good or spam. However, it would be inconvenient to run this
command manually for every email that we receive. Instead we will
configure Sylpheed to run the command on our behalf each time it receives
an email, before delivering the message to the appropriate
directory. Using Sylpheed-claws, this is done by selecting
Configuration from the menu and clicking on
Filtering. There are 3 fields to fill in. The first field is
the Condition. Here we execute bogofilter with the current
incoming email. Enter the following line:
execute "/usr/bin/bogofilter < %F"
The second field determines which action to take if the email is found to
be spam. I recommend leaving this at Move to move the spam email
to the SPAM folder. You could also Delete the email
or just mark it as spam and deliver as usual but I don't recommend either. If
you choose Move as the action, then you should also specify the
mail directory to which to move the messages. Using the Select...
button, choose the SPAM folder we created earlier. Finally,
activate the new filtering rule by clicking Register. Figure 1
shows what the filtering rule should look like.

Figure 1 -- the filtering configuration window
Keeping bogofilter Sharp
The configuration we have implemented so far will probably catch more spam than you think it would. However, the key to success is keeping bogofilter on its toes at all times. Keep training the filter to be able to deal with new types of spam messages and be able to identify non-spam messages for years to come. It would be really convenient to have a "register as spam" button on all email clients. In the future they will probably have this. For now, we have to emulate this functionality ourselves. It's really pretty simple.
We will move spam messages that bogofilter misses to the
SPAM directory manually. After you do this, make bogofilter
process the message by running it with the -s filter again.
It will be too much work to do this manually, so we will create a cron job
that automates this process. This way we can keep moving spam messages to
the SPAM folder as we receive them (effectively scheduling
them to be marked as spam) and rely on the cron job to take care of the
rest for us. You might also want to copy a bunch of good emails into the
NONSPAM directory every once in a while since non-spam words
need to be up to date as well. Here's what a typical script to train
bogofilter everyday may look like:
#!/bin/sh
# /home/oktay/bin/bogolearn.sh
# train bogofilter with new spam and non-spam
# user is assumed to be 'oktay'
BOGOFILTER="/usr/bin/bogofilter";
GOODDIR="/home/oktay/Mail/NONSPAM";
SPAMDIR="/home/oktay/Mail/SPAM";
cd $SPAMDIR
cat * | $BOGOFILTER -s
cd $GOODDIR;
cat * | $BOGOFILTER -n
The following crontab entry will make this script run every
morning at 3:30.
30 3 * * * /home/oktay/bin/bogolearn.sh
This is all there is to it. You will see that your filter gets better and better everyday. You might even start hoping that you will receive more spam just to see how cool bogofilter is.
Resources
- Paul Graham's article
- Bogofilter homepage
- Who is Thomas Bayes?
- Bayes' Probability Theorem
- Sylpheed homepage
- Sylpheed-claws homepage
Oktay Altunergil works for a national web hosting company as a developer concentrating on web applications on the Unix platform.
Return to the Linux DevCenter.
-
Sylpheed-Claws filtering keyword change
2003-10-18 02:11:02 anonymous2 [View]
-
Sylpheed-Claws filtering keyword change
2003-10-29 12:42:12 anonymous2 [View]
-
Extended bogolearn.sh
2003-08-21 00:30:58 anonymous2 [View]
-
Extended bogolearn.sh
2003-08-21 00:30:02 anonymous2 [View]
-
Great!
2003-07-04 04:27:41 anonymous2 [View]
-
Errata and More Info
2003-02-20 12:33:29 oktaya [View]
-
Mozilla 1.3a
2003-02-17 08:30:32 anonymous2 [View]