Linux.com

Feature: Enterprise Applications

Bayesian spam filtering for the masses

By Corrado Cau on October 28, 2003 (8:00:00 AM)

Share    Print    Comments   

Spam, or unsolicited commercial e-mail, is now a sad part of everyday life online. Research companies estimate that more than 50% of the worldwide e-mail traffic is spam. As a result, it's becoming constantly more difficult and time-consuming to sort out legitimate e-mails from the deluge of commercial messages we're being flooded with. But there are ways to fight back. In this series, we'll walk through choosing and setting up a highly effective package for screening out spam.

The first successful attempt to can spam was made by means of public shared blacklists, where people around the world contributed sample junk messages and Internet (IP) addresses. Mail transfer agents (MTA) could check the lists and reject known messages and addresses. At the user level people started using the keyword-based filters offered by e-mail clients. If a message contained an unwanted word, the entire message could be sent automatically to the trash bin by a filter action rule.

Professional spammers began circumventing these defenses by using freshly hacked and ever-changing computers around the world for sending their messages, so that their IP addresses weren't blacklisted, and taking advantage of the modern e-mail standards for garbling and hiding the real message content. For instance, recent junk mails use a mix of HTML code, base-64 MIME encoding, and attachments, pictures, and other tricks to avoid early detection of the message contents. Since these encoding methods are computationally expensive, spammers hope that the real content of their messages won't be spotted and discarded at MTA level, while transiting.

At the same time junk mails are getting more verbose or, on the contrary, minimalistic to the excess, and often mask and dilute the real content by filling up the message with random text or excerpts from books or magazines. These new tricks are putting rule-based filtering out of business, and even public lists are becoming less effective: in no time their information becomes obsolete.

Today we need human-like judgment to effectively discriminate junk mail from real messages. That's quite easy for human beings, but a nightmare for computers.

A solution: Bayesian filters

Introducing Bayesian filters -- programs that can learn by examples to recognize legitimate mail for their owners. These statistical filters learn and remember when a specific word, or group of words, or stream of bytes, is associated with what the user perceives as either junk or good e-mail. I say 'what the user perceives as' because the matter is subjective, and personal; the concept of junk mail changes with the social context, age, and inclinations of the recipient.

Practically speaking, Bayesian filters accumulate statistics every time a user indicates that a certain message belongs to either category, junk or legitimate mail, and progressively build and refine a dictionary of message contents for improving future classification. Once the dictionary is trained, every incoming e-mail message is decoded and parsed, its content matched against the dictionary, and a relative probability of it being junk is calculated.

Users refine filters interactively: they verify a classification and send only the classification errors back to the Bayesian filter so it can refine its classification rules. Classification errors include false negative for junk mail erroneously classified as legitimate, and false positive for a legitimate e-mail classified as junk. A false positive is potentially much more disruptive than a false negative because of the risk of inadvertently trashing an important message. After a short time of use, however, we can confidently expect that about 95% of our e-mails are classified correctly, and hence human intervention goes to virtually zero in time. With adequate training, and depending on the type of Bayesian-derived algorithm in use, some programs can achieve a 99.9% success rate. At that point, the only problem is related to over-confidence in the filter capabilities.

Still, spam does 'evolve.' The scenario is very similar to a real ecosystem, where continuous mutation and differentiation guarantees survival of the fittest. We'll routinely see new types of junk mail, which will make up that recurring 0.1% of errors, requiring re-training a message once or twice a month.

Tools of the trade

To fight spam, I went looking for an open source Bayesian filtering tool. The number of such tools is amazing -- or frightening, depending on the way you look at it. I was looking for a Bayesian filter with the following characteristics:

  • Open-source, GPL or BSD-style
  • Multi-platform
  • Stable, release-grade code
  • Self-contained (not depending on external tools, databases, etc.)
  • High accuracy (multi-word classification)
  • High speed (C or C++, no interpreted languages)
  • Simplicity and versatility of use

After much investigation, I narrowed down the choice to two candidates: Annoyance Filter and CRM114. Both reached 99.9% accuracy in classifying a huge spam-corpus of some hundred megabytes collected from my personal e-mail archive and from many online sources. Annoyance Filter was slightly faster than CRM114; it can process data in excess of 300 KB/sec on my oldish Celeron/600 notebook, and this value skyrockets to almost 1.5 MB/sec on a 2Ghz Athlon PC. In real-life conditions, without any special optimizations, this means processing between 60 and 300 messages per second -- more than adequate for personal use, and probably also for normal corporate use. Of course performance depends heavily on the size and encoding type (plain-text, HTML, MIME, etc.) of the mail messages, and on the number of words you want to take as significant.

Both products use a modified Bayesian algorithm, even though the implementation details differ quite substantially. CRM114 uses a hashcodes-only dictionary of five-word groups, while Annoyance Filter offers users the choice of single to N-word groups, and a conventional full-text dictionary (a fast binary dictionary is also available). Annoyance Filter also sports an integrated POP3 proxy, good for limited-capabilty e-mail clients.

What made me decide was the stability of the code. Annoyance Filter received two minor bug-fix releases (fully backward-compatible) in the last year, so the code is pretty stable. CRM114 has just reached release-candidate status. The code is quite solid and has been useable for many months, but some major changes took place in the past few months, sometimes requiring -- or at least suggesting -- a complete rebuild of the dictionary from scratch.

Tomorrow we'll talk about installing and training Annoyance Filter.

Corrado Cau has worked in the IT field for 15 years and spent most of his career as a system and network administrator on many platforms.

Share    Print    Comments   

Comments

on Bayesian spam filtering for the masses

Note: Comments are owned by the poster. We are not responsible for their content.

POPFile

Posted by: Anonymous Coward on October 28, 2003 10:30 PM
You forgot to mention POPFile. It uses a very capable Bayesian algorithm and also offers a POP3 proxy.

It filters must be trained for a few days but then it reaches nearly a 99% ratio of junk mail classification.

It is GPL and because it is based on Perl scripts, it is multi-platform too (the Windows version even have a nice installer with all its dependencies).

Itīs worth a look: http://popfile.sf.net

#

Re:POPFile

Posted by: Anonymous Coward on October 29, 2003 12:37 AM
What part of NO INTERPRETED CODE did you miss? Perl is interpreted code, ya bonehead!

(but thanks for the suggestion, I'm sure. Oh, what's that? Oh yes, just floodgates keeping back the tide of other similar pet project suggestions. Open that on up.)

#

Re:POPFile

Posted by: Anonymous Coward on October 29, 2003 06:49 AM
Well that's good that the author specified NO INTERPRETED CODE, because Perl is not strictly interpreted. There is a compilation phase and a run-time phase, just like a standard C program. The only difference is the compilation phase happens automatically during program execution.

Or maybe did you mean that there can be no program that interprets its instructions, even after a compilation phase? In that case, pretty much any language would be excluded. Perhaps you should check your facts before making such statements.

#

Re:POPFile

Posted by: Anonymous Coward on October 29, 2003 06:23 AM
No, I didn't neglect to check POPfilter; but it's in Perl, and I was looking for fast, C/C++ code.<nobr> <wbr></nobr>...as another gentleman already remarked<nobr> <wbr></nobr>:-)

Corrado

#

Quick Spam Filter

Posted by: Joe Klemmer on October 29, 2003 04:13 AM
I have been using qsf to augment spammassassin and it's been fairly good (this system is running a very old setup and newer versions of spammassassin won't run on it). I get very rair false positives, maybe one or two a months (it's easy to tell qsf that these are good) and maybe 12 to 15 false negatives (i.e. spam that gets to my inbox) a day. This is not a problem for me as I don't mind a handfull of spam getting through when it catches the 250 a day (that's ontop of the 350-400 a day that spammassassin gets). If I am ever able to upgrade this box to a more current version I think that smappassassin will be enough but will probably keep qsf as a backup.

#

"interpreted" languages

Posted by: Anonymous Coward on October 29, 2003 06:45 AM
Your decision to exclude a whole raft of potentially viable solutions based on something as arbitrary as whether it is "interpreted" or "compiled" completely baffles me. Using performance as the reason is a red herring. I have no problem with your decision to exclude programs that do not perform well, but I think you might be surprised at how capable a properly written "interpreted" program can handle something like this. Maybe you should at least try some before issuing a blanket judgement.

#

Re:"interpreted" languages

Posted by: Anonymous Coward on October 29, 2003 09:21 AM
It's not a matter of belief, it's just a matter of (tested) performance.

While evaluating potential candidates, I simply found that Annoyance Filter (and to a lesser extent, CRM114) performed much better than other tools, some of which by coincidence based on Perl and Python.

Whether it's due to the language or to the programmer, frankly I don't care.

I didn't mean to start a holy war, it was as simple as selecting the right tool for the job.

Corrado

#

Re:"interpreted" languages

Posted by: Anonymous Coward on October 30, 2003 02:25 AM
I doubt anyone is still reading this thread, but I still think I need to clarify my point. I wasn't trying to attack you, or condemn your choice of program to use; I just wanted to point out a common misconception. A program written in Perl or Python will not *necessarily* be slower than its equivalent in C or C++. Even you note this fact in this reply. Why you mentioned interpreted languages to begin with is what confused me.

I think it's perfectly fine to choose a program based on performance. I think your inclusion of that in your selection criteria is perfectly fine, but the special note about "no interpreted languages" was unnecessary, and only results in spreading the misconception further (and in most cases was probably redundant anyway).

#

Filters with a Mail Server

Posted by: Anonymous Coward on October 29, 2003 08:16 PM
Now all that needs to be done is add the filter to a mail server and reject the mail before it reaches a mail box.

#

Re:Filters with a Mail Server

Posted by: Anonymous Coward on October 29, 2003 08:53 PM
That's what I do, in fact. May be in a next article...

But actually you should never trash mail at the MTA level, before the final recipient can judge for him/herself about the nature of the content.

The risk of loosing a legitimate e-mail (0.1 % means 1 msg in 1000 it's possibly a classification error) is too big for doing that.

The safest option is to flag the mail as spam, so that the recipient can have it autosorted, but still transfer it to the recipient's mailbox.

Agree, this doesn't solve the problem of the unwanted traffic, though.

Corrado

#

Re:Filters with a Mail Server

Posted by: rjd on November 05, 2003 11:00 AM
Yes, I get in excess of 1,200 pieces of Spam a day and have to set my filters very high. Occasionally I learn of false positives but can hardly afford to go looking for them on a regular basis. I want to find something as good as Spammix for Eudora, but I want the rejection to happen at the time of connection to the mail server (ie. at SMTP arrival time). Why? Two good reasons, to let the Spammers learn my regular address is not very useful to them, and to let any possible legitimate senders know their message got blocked. The latter being the best I could do to help with false positives.

#

filtering for wanted emails

Posted by: cthulhu tonic on October 29, 2003 08:16 PM
/ ntent matched against the dictionary, and a relative probability of it being junk is calculated.<nobr> <wbr></nobr>/


I presume these bayesian filters calculate the probability of the email in question being a valid too.

I don't get spam so I would not know, but it seems considerably easier to identify emails you want to be getting. Hence you could do this before attempting to filter questionable emails.

#

Re:filtering for wanted emails

Posted by: Anonymous Coward on October 29, 2003 09:02 PM
It's the same; the two faces of the same coin, actually.

When you calculate the 'spam-score', you get a range between 0 and 1; whatever scores over 0.90 is by default (allegedly) spam, and the rest is mail.

Then you just decide if you want to take explicit action for the one or the other.

Corrado

PS: you *don't* get spam??? a miracle!

#

Bayesian Filtering

Posted by: Anonymous Coward on October 30, 2003 01:40 AM
We've been using Bayesian filtering for several months at the email service I use (www.cotse.net), and it didn't take much training for the filter to get uncanny results. I keep training it, but I was amazed at how well it worked.

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya