October 29, 2003

Training Annoyance Filter to combat spam

Author: Corrado Cau

Last time we looked at the problem of spam, and at Bayesian filtering software as a possible solution. Having settled on Annoyance Filter as a product to use to battle spam, we now need to install the software and learn how to use it, before proceeding to integrate it with our e-mail client.

Annoyance Filter is distributed in source-code format for compiling on all of the major platforms. The distribution tarball also contains ready-to-run binaries for Windows.

We'll focus on a generic recent Linux distribution for this article, still keeping in mind that any other platform sporting the bash shell and the GNU C++ compiler will do.

Please note that these quick installation instructions are not meant to replace the real in-depth documentation of Annoyance Filter, but only as minimal guidelines for the impatient.

First download annoyance-filter-1.0b.tar.gz and copy it to a temporary directory. Type the command tar xzvf annoyance-filter-1.0b.tar.gz and you'll end up with a subdirectory called -- guess it -- annoyance-filter-1.0b. Change to that directory and issue the usual commands:

./configure
make
make check (optional, but it verifies the proper functionality of Annoyance Filter)
make install

If you want to add a bit of performance when you compile the source for an i686 (generic Pentium Pro or better) architecture, edit the file Makefile after running ./configure and add the switch -march=i686 to the line starting with CFLAGS; afterwards, it should look like:

CFLAGS = -Wall -g -march=i686 -O2 (or you may want to use athlon or pentium4 here)

You should now have an executable ready to use in /home/Your-Username/.annoyance-filter.

The author of Annoyance Filter suggests using the program as a personal application and not a system-wide utility; in my opinion symlinking it to /usr/local/bin makes life easier. To do that, change to the superuser and create a symlink named af in /usr/local/bin, pointing to the annoyance-filter executable:

ln -s /home/Your-Username/.annoyance-filter/annoyance-filter /usr/local/bin/af

Creating the dictionary

Now we want to create a new dictionary. Annoyance Filter uses a main dictionary, portable among different architectures and fully interoperable, and a lightweight version for fast operations.

We need a sample e-mail message for creating the dictionary, and many more for training it. Open your e-mail client -- I used KMail. Begin by right-clicking on a legitimate message and selecting 'Save as,' and give the name MyMail-1.txt. Now run the command:

af -v --phrasemin 1 --phrasemax 2 --mail MyMail-1.txt --prune --write Dict.bin --fwrite FastDict.bin

and you'll find the new dictionaries in the current directory. I suggest moving them to the .annoyance-filter directory in your home directory and creating two symlinks in /usr/local/bin pointing to them.

Annoyance Filter uses only one dictionary for storing statistics about both junk and legitimate e-mail; every token (a word or group of words) in the dictionary gets a score denoting the probability of it being junk mail, and these probabilities are used when classifying new e-mail messages.

Dict.bin (as I chose to call it) is the main dictionary, while FastDict.bin is the faster binary representation of Dict.bin. It can be re-created from Dict.bin when needed with the command:

af -v --read Dict.bin --prune --fwrite FastDict.bin

So far the dictionary knows only one message, and it needs to gather much more information before getting a grip on what its owner (dis)likes. Go back to KMail, select all of the sent messages (they're usually kept in the Sent-Mail folder) by pressing the letter K, then right-click and save them all at once in a BSD-style mailfolder, or mbox; Annoyance Filter has a special option (--bsdfolder) for coping with mailfolders, so that it can learn hundreds of messages in one shot. The messages you send are normally representative of the types of e-mails you'd like to receive, so it's a good thing to train Annoyance Filter using them as examples of legitimate e-mail.

af -v --read /usr/local/bin/Dict.bin --phrasemin 1 --phrasemax 2 --bsdfolder --mail FileName.txt --prune
--write /usr/local/bin/Dict.bin --fwrite /usr/local/bin/FastDict.bin

Note: the above command must be typed in one line; it's split in two for clarity.

Now Annoyance Filter has an idea of what legitimate messages should look like, but it still needs to learn about junk mail. If you've collected some spam already, great; otherwise, you should start doing this right now. Quickly, you'll have a collection of samples for your type of spam.

If you're in a rush and have no 'personal spam' available, you can resort to some public spam archives (SpamAssassin.org and Annexia.org come to mind) and collect a few dozens of junk-mails. Of course this gives you only generic training, since the mix of spam and legitimate messages is peculiar to each individual.

You can train Annoyance Filter with them one by one with the command:

af -v --read /usr/local/bin/Dict.bin --phrasemin 1 --phrasemax 2 --junk SingleSpam.txt --prune
--write /usr/local/bin/Dict.bin --fwrite /usr/local/bin/FastDict.bin

or collectively, by means of a BSD-style mailfolder, with the command:

af -v --read /usr/local/bin/Dict.bin --phrasemin 1 --phrasemax 2 --bsdfolder --junk SpamFolder.txt --prune
--write /usr/local/bin/Dict.bin --fwrite /usr/local/bin/FastDict.bin

In the above setup we're building a dictionary containing single words and groups of two words as tokens. The --prune option discards very infrequent tokens, so that dictionary space isn't wasted with a plethora of useless tokens. In Annoyance Filter, you can choose an arbitrarily long number of words as a token, from single words to groups of, say, 10 words; in practice, the more words per token the more performances will degrade, unfortunately without necessarily better results.

I've experimented with tokens ranging in size from one to five words, and found the best compromise between performance and accuracy by using one- and two-word tokens at the same time. If you don't trust my experiments, or just want to have some fun (?!) on your own, try raising the --phrasemin and --phrasemax parameters up one notch (or more), rebuilding the dictionary, and judging for yourself.

When training or classifying e-mail messages you must always pass them to Annoyance-Filter along with all of the e-mail headers. Thus, you cannot select the text of an e-mail and cut-and-paste it to a file; if you did, the mail headers and the original encoding would be lost, and they play an important role in the learning and classification processes. The correct way of exporting a message from most graphical e-mail clients is to right-click on it and choose Save As.

If the process of building a dictionary from scratch sounds too time-consuming for you, I've put together a utilities tarball you can download that includes a pre-built dictionary -- spam only, you will still need to add training for your own legitimate mails.

Training one by one

Now it's time to start classifying some samples from our e-mail archive: that lets us judge whether the dictionary level of training is already satisfactory or if it needs some more learning.

Extract a few messages from KMail and try classifying them, one by one, with the command:

af -v --fread /usr/local/bin/FastDict.bin --phrasemin 1 --phrasemax 2 --class sample1.txt

For this trial we're using the fast dictionary, instead of the full-blown one; the fast dictionary is in fact conceived to be just read, and it's optimized for access speed.

The output of the classification command will be something like this (we used the -v verbose flag):

 Loaded fast dictionary from /usr/local/bin/FastDict.bin.
 Phrase minimum length set to 1 word.
 Phrase maximum length set to 2 words.
 Rank  Probability  Token
  1      0.99 your needs
  2      0.99 the clock
  3      0.99 receive future
  4      0.99 plain content-transfer-encoding
  5      0.99 net id
  6      0.01 mon oct
  7      0.99 link below
  8      0.99 gmt x-mailer
  9      0.99 get out
 10      0.99 future offers
 11      0.99 dyndns org
 12      0.99 dyndns
 13      0.99 debt
 14      0.99 d c
 15      0.99 build mime-version
ProbP = 0.0086875, ProbQ = 9.9e-29
Message junk probability: 1
JUNK

We're being shown the 15 most significant tokens found in the message; a probability of 0.99 is the maximum junk probability, while the ranking is determined by the number of times a token has been found inside junk mails in the past, and learned accordingly.

In this case the e-mail was recognised as junk, and indeed junk it was; otherwise the final verdict would have been MAIL instead.

Imagine that for some reason -- insufficient training? -- the above message got erroneously classified as MAIL; in that case, we would need to (re)train Annoyance Filter with that message, specifically directing it to 'learn it as junk.' This is easily accomplished entering:

af -v --read /usr/local/bin/Dict.bin --phrasemin 1 --phrasemax 2 --junk Sample1.txt --prune
--write /usr/local/bin/Dict.bin --fwrite /usr/local/bin/FastDict.bin

If you need to teach Annoyance Filter a legitimate mail instead, use --mail in lieu of --junk.

Afterwards we should check that the new training was enough by checking the classification once more. If the mail is now correctly showing as JUNK, fine; otherwise, repeat the above step as many times as necessary for achieving correct classification. Sometimes it's necessary to repeat the same training two or three times, especially when the dictionary is already heavily trained, or when the message is a borderline case -- that is, a message that could be either good or junk mail, that you want to classified in a different way than the software chooses.

Always remember the TOE principle -- Train Only Errors. If it ain't broken, don't fix it. Moreover, do not overdo training. The more you bias the dictionary toward junk (or good) e-mails by unnecessary learning, the more likely the software will misclassify messages in that direction. It's obvious, if you think of it: if you've got only nails, whatever you see resembles a hammer.

Even advanced users should Read The Fine Manual. It's enough to read page 5 through 9, to learn about more advanced options like:

--pdiag (possibly in association with --ptrace)
--biasmail and --newword
--sigwords
--treshjunk and --treshmail

Those options should be enough to keep you playing for a lifetime, but they are actually useful for heavily altering the default behavior of Annoyance Filter (which quite honestly fits most of us), and for gathering low-level diagnostics when things don't go the way they're supposed to.

A final note -- remember to back-up your Dict.bin dictionary from time to time. Why? Bayesian filters are a hell on earth for regression testing. When adding even one token to the dictionary, you could be surprised by the way it affects the general statistics. Sometimes this means your former excellent scores can go substantially down the drain.

You can also export the dictionary to a CSV-format comma-delimited text file, for fiddling around with the data -- not recommended -- or for merging the data with other dictionaries from (trusted!) friends and colleagues. Use the commands:

Export CSV data: af -v --read /usr/local/bin/Dict.bin --csvwrite DictList.csv
Import CSV data: af -v --csvread OtherDict.csv --write NewDict.bin
Import and Merge CSV data: af -v --read /usr/local/bin/Dict.bin --csvread OtherDict.csv --write NewDict.bin

Once you have Annoyance Filter working to your satisfaction you'll want to integrate it with your e-mail client. We'll talk about integrating Annoyance Filter with KMail in the final part of this series.

Corrado Cau has worked in the IT field for 15 years and spent most of his career as a system and network administrator on many platforms.

Category:

  • Enterprise Applications
Click Here!