October 28, 2003

Bayesian spam filtering for the masses

Author: Corrado Cau

Spam, or unsolicited commercial e-mail, is now a sad part of everyday life online. Research companies estimate that more than 50% of the worldwide e-mail traffic is spam. As a result, it's becoming constantly more difficult and time-consuming to sort out legitimate e-mails from the deluge of commercial messages we're being flooded with. But there are ways to fight back. In this series, we'll walk through choosing and setting up a highly effective package for screening out spam.

The first successful attempt to can spam was made by means of public shared blacklists, where people around the world contributed sample junk messages and Internet (IP) addresses. Mail transfer agents (MTA) could check the lists and reject known messages and addresses. At the user level people started using the keyword-based filters offered by e-mail clients. If a message contained an unwanted word, the entire message could be sent automatically to the trash bin by a filter action rule.

Professional spammers began circumventing these defenses by using freshly hacked and ever-changing computers around the world for sending their messages, so that their IP addresses weren't blacklisted, and taking advantage of the modern e-mail standards for garbling and hiding the real message content. For instance, recent junk mails use a mix of HTML code, base-64 MIME encoding, and attachments, pictures, and other tricks to avoid early detection of the message contents. Since these encoding methods are computationally expensive, spammers hope that the real content of their messages won't be spotted and discarded at MTA level, while transiting.

At the same time junk mails are getting more verbose or, on the contrary, minimalistic to the excess, and often mask and dilute the real content by filling up the message with random text or excerpts from books or magazines. These new tricks are putting rule-based filtering out of business, and even public lists are becoming less effective: in no time their information becomes obsolete.

Today we need human-like judgment to effectively discriminate junk mail from real messages. That's quite easy for human beings, but a nightmare for computers.

A solution: Bayesian filters

Introducing Bayesian filters -- programs that can learn by examples to recognize legitimate mail for their owners. These statistical filters learn and remember when a specific word, or group of words, or stream of bytes, is associated with what the user perceives as either junk or good e-mail. I say 'what the user perceives as' because the matter is subjective, and personal; the concept of junk mail changes with the social context, age, and inclinations of the recipient.

Practically speaking, Bayesian filters accumulate statistics every time a user indicates that a certain message belongs to either category, junk or legitimate mail, and progressively build and refine a dictionary of message contents for improving future classification. Once the dictionary is trained, every incoming e-mail message is decoded and parsed, its content matched against the dictionary, and a relative probability of it being junk is calculated.

Users refine filters interactively: they verify a classification and send only the classification errors back to the Bayesian filter so it can refine its classification rules. Classification errors include false negative for junk mail erroneously classified as legitimate, and false positive for a legitimate e-mail classified as junk. A false positive is potentially much more disruptive than a false negative because of the risk of inadvertently trashing an important message. After a short time of use, however, we can confidently expect that about 95% of our e-mails are classified correctly, and hence human intervention goes to virtually zero in time. With adequate training, and depending on the type of Bayesian-derived algorithm in use, some programs can achieve a 99.9% success rate. At that point, the only problem is related to over-confidence in the filter capabilities.

Still, spam does 'evolve.' The scenario is very similar to a real ecosystem, where continuous mutation and differentiation guarantees survival of the fittest. We'll routinely see new types of junk mail, which will make up that recurring 0.1% of errors, requiring re-training a message once or twice a month.

Tools of the trade

To fight spam, I went looking for an open source Bayesian filtering tool. The number of such tools is amazing -- or frightening, depending on the way you look at it. I was looking for a Bayesian filter with the following characteristics:

  • Open-source, GPL or BSD-style
  • Multi-platform
  • Stable, release-grade code
  • Self-contained (not depending on external tools, databases, etc.)
  • High accuracy (multi-word classification)
  • High speed (C or C++, no interpreted languages)
  • Simplicity and versatility of use

After much investigation, I narrowed down the choice to two candidates: Annoyance Filter and CRM114. Both reached 99.9% accuracy in classifying a huge spam-corpus of some hundred megabytes collected from my personal e-mail archive and from many online sources. Annoyance Filter was slightly faster than CRM114; it can process data in excess of 300 KB/sec on my oldish Celeron/600 notebook, and this value skyrockets to almost 1.5 MB/sec on a 2Ghz Athlon PC. In real-life conditions, without any special optimizations, this means processing between 60 and 300 messages per second -- more than adequate for personal use, and probably also for normal corporate use. Of course performance depends heavily on the size and encoding type (plain-text, HTML, MIME, etc.) of the mail messages, and on the number of words you want to take as significant.

Both products use a modified Bayesian algorithm, even though the implementation details differ quite substantially. CRM114 uses a hashcodes-only dictionary of five-word groups, while Annoyance Filter offers users the choice of single to N-word groups, and a conventional full-text dictionary (a fast binary dictionary is also available). Annoyance Filter also sports an integrated POP3 proxy, good for limited-capabilty e-mail clients.

What made me decide was the stability of the code. Annoyance Filter received two minor bug-fix releases (fully backward-compatible) in the last year, so the code is pretty stable. CRM114 has just reached release-candidate status. The code is quite solid and has been useable for many months, but some major changes took place in the past few months, sometimes requiring -- or at least suggesting -- a complete rebuild of the dictionary from scratch.

Tomorrow we'll talk about installing and training Annoyance Filter.

Corrado Cau has worked in the IT field for 15 years and spent most of his career as a system and network administrator on many platforms.


  • Enterprise Applications
Click Here!