Spam Busters

by Richard Koman

Dear Friend,

Forgive my indignation if this message comes to you as a surprise and if it might offend you without your prior consent and writing through this channel.

I am DR. USMAN DAN FODIO, The Chairman, Contract Awarding Committee of the ECONOMIC COMMUNITY OF WEST AFRICAN STATES (ECOWAS) with Headquarters in Lome, Togo. I got your information in a business directory from the Togolaise Chamber of Commerce and Industries when I was searching for a reliable, honest, and trustworthy person to entrust this business with. I was simply inspired and motivated to pick your contact from the many names and lists in the directory.

After discussing my view and your profile with my colleagues, they were very much satisfied and decided to contact you immediately for this mutual business relationship. We wish to transfer the sum of USD 25,000,000.00 (Twenty-five million United States dollars only)into your personal or company's bank account.

Everybody recognizes this famous scam as spam. (For an astounding collection of the endless variations of this message, see the Cyber Criminals Most Wanted site.) It's a classic spam that's easily caught by most of the anti-spam filters out there. These days, spammers use all kinds of tricks to try to get their messages past the filters and into your mailbox. For example, would your spam filter snag this message--taken from Dr. John Graham-Cumming's Spammer's Compendium page--as spam?

<table cellpadding=0 cellspacing=0 border=0><tr>
<td><table cellspacing=0 cellpadding=0 border=0><tr><td>
<font face="Courier New, Courier, mono" size=2>
 <br>U<br> <br>O<br>a<br> <br>D<br>u<br>a
<br> <br>N<br> <br>B<br>d<br> <br>N<br> 
<br>C<br> <br>C<br>w<br> <br>1<br> <br> 
<br> <br>1<br> <br>C<br>S<br></font></td></tr></table></td>
<td><table cellspacing=0 cellpadding=0 border=0><tr><td><font
face="Courier New, Courier, mono" size=2>
   <br> N <br>   <br>bta
<br>nd <br>   <br>ipl<br>niv<br>nd <br>
   <br>o r<br>   <br>ach<br>ipl
<br>   <br>o o<br>   <br>onf<br>
   <br>ALL<br>ith<br>   <br> - 
<br>   <br>   <br>   <br>
 - <br>   <br>all<br>und<br></font></td></tr></table></td>
<td><table cellspacing=0 cellpadding=0 border=0><tr><td><font
face="Courier New, Courier, mono" size=2>
   <br>I V<br>   <br>in <br>the
<br>   <br>oma<br>ers<br>lif<br>   <br>equ
<br>   <br>elo<br>oma<br>   <br>ne <br>
   <br>ide<br>   <br> NO<br>in <br>
   <br>3 1<br>   <br>   
<br>   <br>2 1<br>   <br> 24<br>ays
<td><table cellspacing=0 cellpadding=0 border=0><tr><td><font face="Courier
New, Courier, mono" size=2>
  <br> E<br>  <br>a <br> a<br>  
<br>s <br>it<br>e <br>  <br>ir<br>  <br>rs<br>s
 <br>  <br>is<br>  <br>nt<br>  <br>W 
<br>da<br>  <br> 2<br>  <br>  <br> 
 <br> 2<br>  <br> h<br> a<br></font></td></tr></table></td>

Here's the rendered HTML:

Everybody knows spam is out of control, and everyday it seems to get worse. If it's an irritant to end users, it's a major drain to businesses and ISPs. Ferris Research estimates that spam costs American businesses $10 billion per year and that fully half of email traffic will be spam by 2008. Clearly the time is right to do something about it. And it's not just the cost of wasted bandwidth and disk space. Companies will likely face sexual discrimination suits based on employees receiving spam for pornography at work.

The rise of groups like the IRTF's Anti-Spam Research Group and Jamspam indicate that the industry seems to be moving towards dealing with the onslaught. But, says Jesse Dougherty, director of development at ActiveState, these groups are dealing with architectural changes and with accommodating the needs of legitimate bulk mailers. No one is dealing with the needs of the recipients of spam, specifically the enterprises that bear the costs and liabilities of receiving spam.

Thus, ActiveState recently brought together a blue ribbon panel of spam busters to develop new ways to make anti-spam software more effective and appropriate at the enterprise level. I talked with most of the members of the Anti-Spam Task Force by phone recently. On the call were Dougherty, Dr. John Graham-Cumming, creator of the popular open source, Perl-based, Bayesian mail-filtering program POPfile; Tim Peters, creator of SpamBayes, a Python-based, open source Bayesian email classifier; and Jason Rennie of MIT's Artificial Intelligence lab and creator of the open source tool iFile, an automated email-classification system. Unable to join us was Gary Robinson, an innovator in collaborative filtering.

Richard Koman: What is the size of the problem and why create a task force at this time?

Jesse Dougherty: There are two other groups working on the spam problem right now. There's the IRTC, which is focusing on the problem of consent and some of the architectural changes required to support that, changes to the DNS structure and signatures, and so on. That's definitely an important part of the problem. Then there's Jamspam, which, by highlighting the false positive problem, is there to alert the anti-spam software community that legitimate bulk mailers would like them to stop blocking their mail. That's really the arena where the ISPs and the bulk mailers are getting together to figure out best practices. Left out of the picture is anyone representing the recipients' requirements. What I want to receive and how I enforce that--whether I'm an enterprise or an at-home recipient. We think that getting this group together, the people who are really the leaders in actually developing content tools, would be complementary to the other two groups.

John Graham-Cumming: I say now because spam is a household word. Every single person gets spam; it's become an epidemic. And so it's time for people to try to stop it. And ActiveState has taken the lead on the enterprise side.

Tim Peters: The increasing volume is extremely visible to anyone who uses a computer and has any sort of access over the web for people to get at their email address. I have a very high tolerance for spam and I get over 100 of them a day now and it's an increasing drag on my ability to enjoy my computer or to do my work. I'm personally motivated to find some way to stem this tide of crap.

Koman: Is there a difference between "spam" and unsolicited mail?

Dougherty: I think there's two different problems and I'd like to just define spam in some meaningful way. Spam is not a problem; it's the result of a problem. The problem is that people don't have tools to enforce their receipt policies, whether at home or at an enterprise. Spam is the messages that are getting in. It's the fact that you can't say that "I will enforce this level of consent required to send to me," or "I will only accept content of this type," or "I won't accept mail sent in bulk that is of this content." They can't enforce their specific definition of what is spam or unwanted mail.

Koman: So the definition of spam is?

Dougherty: I've been trying to push the three C's definition, which is consent, content, and circulation. You need to have those three pieces of information about a message before you can make a decision about whether you'll receive it. IRTF is contributing the consent part, and the circulation part you have to guess at, and then finally, the big part is content--what is this message? And that's what we can help provide.

Koman: OK, how are you guys tackling the content side?

Graham-Cumming: Our current thinking is that some forms of adaptive filtering--the most buzzword-compliant at the moment is Bayesian--will be an effective tool at dealing with the content side of things because it's very good at recognizing particular types of content in email and distinguishing it from other types, for example, distinguishing spam from "not-spam."

The real thing is that those tools are quite effective for the individual end user and you can download many free ones, but in the enterprise setting it's not viable to, say in a 100,000-person organization, that every single person must manage their spam filter on their desktop. And so what we're doing on the task force is really thinking about how to apply those same techniques to the IT department, so the IT department can have effective tools to deal with spam across the enterprise.

Jason Rennie: Probably the next step after choosing an effective filtering tool and trying to adapt it to the enterprise level is figuring out what aspects of email you need to look at in order to identify the spam. There are lots of things that spammers do that let you identify what they're doing. A lot of it has to do with trying to cover up what they're doing. They'll break up words with spaces to make it so that you can read a word but a filter will have trouble identifying it.

Koman: Yes, I was just looking at John's Spammer's Compendium page. I was looking at this very complex example you have; it starts with base64 encoding and goes from there.

Rennie: Oh, that's a great one.

Graham-Cumming: And it ends up with a very polite message.

Koman: Yes, something about dogs slurping.

Peters: I'd like to add a little to this. I think the technology to identify spam has made enormous strides in the past year. For example, people who download the project I've been working on report they don't have a spam problem anymore, but there are two problems with it. One is that it takes several megabytes of storage space and if you are talking about that 100,000-user organization, we're talking about terabytes of storage. The other is that integration with mail clients is very quirky. There are a lot of different mail clients. They don't have a common programming interface; some, like Outlook Express, have no programming interface whatsoever. And in order to use this technology effectively now individual users have to be awfully savvy, they have to know what they're doing.

What ActiveState is trying to do is make this more of a server-side thing, where users aren't that aware of it and they don't have to be so involved in the technical details. I think the technology is there but it is not usable by the masses yet.

Dougherty: Yes, it has to be applied in such a way that it lowers the cost of management and the cost of ownership of a system like that for a large set of users, whether from an ISP or a large enterprise.

Peters: Yeah, the technology's there but it's economically infeasible for a large organization to use it.

Koman: Spam strikes many people as a self-mutating virus; that no matter what filters you set up, they're always working around them. Is that inaccurate?

Dougherty: There's a limited space of innovation that they have, in that it has to be delivered in a MIME message. And, you know, some of the tools they're using now are going to be taken away from them. For instance, Outlook 2003 is going to be pulling support for HTML email. People are moving away from supplying them with so many of these abilities. So it's not an infinite space they have to innovate in.

Peters: This might be a minority view here, but from what I've seen of spammers they're not all that technically good. They've got a handful of tricks you see over and over again; they're not hard to spot. Spam does evolve but usually in what strikes me as trivial ways.

Graham-Cumming: I think they're buying software from some people who are educated in the way in which email works and then they use that software repeatedly. So as soon as you see a new mutation, it's extremely easy to train a filter and just block all that mail.

Peters: And a filter can often learn that all on its own without human intervention. Some of the software is really surprising. We collect header clues, for example, and we discovered that the case of keywords in RFCA22 headers is highly significant. In MIME version, in particular, if it's "MiME," it's just an extremely strong SPAM clue, apparently due to the software the spammers are using.

They create clues by accident. They don't even know they're doing it. We've preserved case in the RFCA-22 header words just because it turns out that the spammer software is bizarre. It often creates them in all uppercase and nobody else does. Very weird, and I'm still seeing it. They've haven't caught on to that.

Dougherty: So let's not tell them!

Rennie: The Bayesian techniques are nice in that they can ... if you give them the capability to look at, say, weird casings in headers, then you don't actually need to change the machinery at the client's end. Your filter at the client's end can simply be seeing the stream of emails that come through and account for the fact that emails have these weird casings and change on the fly to catch that without major structural changes.

Koman: What about the question of overly strong filters, where desirable mail is getting filtered out?

Peters: There certainly can be. One of the persistent false positives in my database ... you're familiar with the Nigerian scam type of spam?

Koman: Yeah, I've gotten about 100 of those in the past several weeks.

Peters: Right, so sometimes somebody will just forward you a copy of that with a comment like, "Hey, you think this is spam?" And in the meantime it has 500 instances of the words "transfer your funds," and under my system that's almost certainly going to be classified as spam.

Dougherty: Things like that will become less and less of a problem if the consent and sender authentication stuff takes off.

Peters: That's a weakness of the kind of system I'm putting out. Every clue is treated with equal weight and there is no white-list, there is no blacklist. It would be better to combine it with other techniques.

Dougherty: The false-positive problem really comes down to, for the enterprise customer, users being able to choose who they're going to white-list based on the behavior of that white-listed sender. And that really comes down to who treats us well when we give them our addresses? So the enterprise needs to be given information about who the good guys are when it comes to legitimate marketing messages and who the bad guys are. And that's a different initiative.

Koman: Is that something you're working on?

Dougherty: Yes, we actually have an initiative to test the behavior of bulk mailers as they receive addresses to see what they do with them. And we'll be publishing a good guys list. Not a bad guys list because that would be too ...

Koman: Too long, yeah.

Peters: In the classifier I'm working on, there's three classifications. There's spam, there's not-spam, and there's "gosh I don't know what to do with this." The first time I do business online with a new company, the first one or two messages I get from them are often full of marketing material, and even though I wanted it, it's very likely to score as unsure.

Rennie: The key to addressing false-positive issues is twofold. One that Jesse just talked about is formalizing what companies are willing to accept, and writing those rules into the filters rather than just having this black box that says spam or not-spam. Another thing that can help ... one issue with a lot of the techniques that are out there, the Bayesian filters, like Tim was saying, is they will weight everything identically, so if 90 percent of the message looks like spam, they will classify it as spam.

I've done a lot of work in the field of classification and there are more advanced classification techniques that will weight the features of a message in different ways, so if there's a lot of spam material in an email, but you realize it is coming from a sender that you talk to, or something like that, it will get weighted very heavily ...

Koman: Can you talk about the idea of scanning the content of the message before it actually gets dumped into your inbox? How would that work?

Dougherty: Well, currently ActiveState PureMessage runs on the SMTP gateway, so as the actual SMTP transaction is occurring, we review and rebuild the messages, scan them, look for spammy features. At that point, administrators can set their policies so that if this is clearly spam or unacceptable email, they can just reject the SMTP discussion at that point.

Koman: One thing that comes to mind in this conversation is the Intel case regarding the ex-employee who's been sending messages to all of Intel.

Rennie: That's actually a very interesting case and a very positive thing for ActiveState's strategy, which is to give more power to the companies, to the people who actually run the networks. That court case said since this is your internal network you're allowed to say what gets in. Giving people control at the gateway is a good thing, is a realistic thing, and it's very important.

Koman: That case is actually about to be argued at the California Supreme Court, so the final decision there has yet to be issued. I wonder if you want to talk about the social aspect; is it always appropriate for the company to have control of the email that occurs there?

Dougherty: The company has an obligation to monitor the content of email the same way they have an obligation to monitor the content on a wall in an office, so that if someone were to come in and post pornography on the office walls, there would be an issue of liability; the same thing happens if employees start to receive offensive content via their email. There's very little distinction between the locations of that pornography from a legal point of view. They do have some responsibility to provide a safe workplace.

So that's one side of it. Now, if you look at the legal arguments in this case, they really have to do with, does this person have the consent to send to these people? Now it's arguable that he had it during the time that he worked there but it was essentially revoked when he was let go, and so at that point he no longer had the right to use the company network to communicate with employees. Now that will apparently be where they'll end up arguing a lot because some person off the street has the right to send email to them as well.

Koman: I don't really want to argue the merits of the case but it brings up certain issues around free speech and whether the company has the right to absolutely control what kind of speech travels over the network.

Dougherty: I think that when you couple it with the company's liability, the company has to own it.

Koman: Are you seeing anything new or interesting recently?

Graham-Cumming: The most recent one I saw is the black hole, which is on my list. It's pretty much a spacing-out-words trick. What spammers do is they put a space in it, then they specify that space is in FONT SIZE=0.

Peters: Oh, that's clever.

Graham-Cumming: It actually uses the &nbsp; HTML entity, so it's not even a space; it's a nonbreaking space.

Peters: I wouldn't notice that, our classifier strips those out before we even look at the message.

Graham-Cumming: There you go, that's a very effective way of dealing with it, just get rid of HTML completely.

Koman: And just stripping out the HTML completely, does that remove the issue?

Graham-Cumming: Yeah, you can strip out the HTML and that will give you the word they're trying to disguise. The one I saw, unsurprisingly, was the word "Viagra." The other thing you can do is spot the use of FONT SIZE=0 and say, that's an interesting feature to hang onto because who the hell is using FONT SIZE=0 except a spammer?

Dougherty: The second case, where you actually identify the spammer's technique, will tend to block that technique forever, as opposed to just trying to find what they're trying to say. So you'll catch a lot more spam blocking FONT SIZE=0 than blocking Viagra.

Koman: So what does the future look like? What will the task force be delivering to the enterprise?

Dougherty: We'll be providing white papers about the behavior of spammers and spam-analysis techniques over the next quarter. ActiveState will be releasing a new anti-spam engine that's being designed with input from this group. You'll also see as-needed participation in the IRTF and Jamspam, if there are ways we can move that forward, again, representing the recipient in some meaningful way.

Richard Koman is a freelancer writer and editor based in Sonoma County, California. He works on SiliconValleyWatcher, ZDNet blogs, and is a regular contributor to the O'Reilly Network.

Return to the Policy DevCenter