Linux.com

Feature: Enterprise Applications

Training Annoyance Filter to combat spam

By Corrado Cau on October 29, 2003 (8:00:00 AM)

Share    Print    Comments   

Last time we looked at the problem of spam, and at Bayesian filtering software as a possible solution. Having settled on Annoyance Filter as a product to use to battle spam, we now need to install the software and learn how to use it, before proceeding to integrate it with our e-mail client.

Annoyance Filter is distributed in source-code format for compiling on all of the major platforms. The distribution tarball also contains ready-to-run binaries for Windows.

We'll focus on a generic recent Linux distribution for this article, still keeping in mind that any other platform sporting the bash shell and the GNU C++ compiler will do.

Please note that these quick installation instructions are not meant to replace the real in-depth documentation of Annoyance Filter, but only as minimal guidelines for the impatient.

First download annoyance-filter-1.0b.tar.gz and copy it to a temporary directory. Type the command tar xzvf annoyance-filter-1.0b.tar.gz and you'll end up with a subdirectory called -- guess it -- annoyance-filter-1.0b. Change to that directory and issue the usual commands:

./configure
make
make check (optional, but it verifies the proper functionality of Annoyance Filter)
make install

If you want to add a bit of performance when you compile the source for an i686 (generic Pentium Pro or better) architecture, edit the file Makefile after running ./configure and add the switch -march=i686 to the line starting with CFLAGS; afterwards, it should look like:

CFLAGS = -Wall -g -march=i686 -O2 (or you may want to use athlon or pentium4 here)

You should now have an executable ready to use in /home/Your-Username/.annoyance-filter.

The author of Annoyance Filter suggests using the program as a personal application and not a system-wide utility; in my opinion symlinking it to /usr/local/bin makes life easier. To do that, change to the superuser and create a symlink named af in /usr/local/bin, pointing to the annoyance-filter executable:

ln -s /home/Your-Username/.annoyance-filter/annoyance-filter /usr/local/bin/af

Creating the dictionary

Now we want to create a new dictionary. Annoyance Filter uses a main dictionary, portable among different architectures and fully interoperable, and a lightweight version for fast operations.

We need a sample e-mail message for creating the dictionary, and many more for training it. Open your e-mail client -- I used KMail. Begin by right-clicking on a legitimate message and selecting 'Save as,' and give the name MyMail-1.txt. Now run the command:

af -v --phrasemin 1 --phrasemax 2 --mail MyMail-1.txt --prune --write Dict.bin --fwrite FastDict.bin

and you'll find the new dictionaries in the current directory. I suggest moving them to the .annoyance-filter directory in your home directory and creating two symlinks in /usr/local/bin pointing to them.

Annoyance Filter uses only one dictionary for storing statistics about both junk and legitimate e-mail; every token (a word or group of words) in the dictionary gets a score denoting the probability of it being junk mail, and these probabilities are used when classifying new e-mail messages.

Dict.bin (as I chose to call it) is the main dictionary, while FastDict.bin is the faster binary representation of Dict.bin. It can be re-created from Dict.bin when needed with the command:

af -v --read Dict.bin --prune --fwrite FastDict.bin

So far the dictionary knows only one message, and it needs to gather much more information before getting a grip on what its owner (dis)likes. Go back to KMail, select all of the sent messages (they're usually kept in the Sent-Mail folder) by pressing the letter K, then right-click and save them all at once in a BSD-style mailfolder, or mbox; Annoyance Filter has a special option (--bsdfolder) for coping with mailfolders, so that it can learn hundreds of messages in one shot. The messages you send are normally representative of the types of e-mails you'd like to receive, so it's a good thing to train Annoyance Filter using them as examples of legitimate e-mail.

af -v --read /usr/local/bin/Dict.bin --phrasemin 1 --phrasemax 2 --bsdfolder --mail FileName.txt --prune
--write /usr/local/bin/Dict.bin --fwrite /usr/local/bin/FastDict.bin

Note: the above command must be typed in one line; it's split in two for clarity.

Now Annoyance Filter has an idea of what legitimate messages should look like, but it still needs to learn about junk mail. If you've collected some spam already, great; otherwise, you should start doing this right now. Quickly, you'll have a collection of samples for your type of spam.

If you're in a rush and have no 'personal spam' available, you can resort to some public spam archives (SpamAssassin.org and Annexia.org come to mind) and collect a few dozens of junk-mails. Of course this gives you only generic training, since the mix of spam and legitimate messages is peculiar to each individual.

You can train Annoyance Filter with them one by one with the command:

af -v --read /usr/local/bin/Dict.bin --phrasemin 1 --phrasemax 2 --junk SingleSpam.txt --prune
--write /usr/local/bin/Dict.bin --fwrite /usr/local/bin/FastDict.bin

or collectively, by means of a BSD-style mailfolder, with the command:

af -v --read /usr/local/bin/Dict.bin --phrasemin 1 --phrasemax 2 --bsdfolder --junk SpamFolder.txt --prune
--write /usr/local/bin/Dict.bin --fwrite /usr/local/bin/FastDict.bin

In the above setup we're building a dictionary containing single words and groups of two words as tokens. The --prune option discards very infrequent tokens, so that dictionary space isn't wasted with a plethora of useless tokens. In Annoyance Filter, you can choose an arbitrarily long number of words as a token, from single words to groups of, say, 10 words; in practice, the more words per token the more performances will degrade, unfortunately without necessarily better results.

I've experimented with tokens ranging in size from one to five words, and found the best compromise between performance and accuracy by using one- and two-word tokens at the same time. If you don't trust my experiments, or just want to have some fun (?!) on your own, try raising the --phrasemin and --phrasemax parameters up one notch (or more), rebuilding the dictionary, and judging for yourself.

When training or classifying e-mail messages you must always pass them to Annoyance-Filter along with all of the e-mail headers. Thus, you cannot select the text of an e-mail and cut-and-paste it to a file; if you did, the mail headers and the original encoding would be lost, and they play an important role in the learning and classification processes. The correct way of exporting a message from most graphical e-mail clients is to right-click on it and choose Save As.

If the process of building a dictionary from scratch sounds too time-consuming for you, I've put together a utilities tarball you can download that includes a pre-built dictionary -- spam only, you will still need to add training for your own legitimate mails.

Training one by one

Now it's time to start classifying some samples from our e-mail archive: that lets us judge whether the dictionary level of training is already satisfactory or if it needs some more learning.

Extract a few messages from KMail and try classifying them, one by one, with the command:

af -v --fread /usr/local/bin/FastDict.bin --phrasemin 1 --phrasemax 2 --class sample1.txt

For this trial we're using the fast dictionary, instead of the full-blown one; the fast dictionary is in fact conceived to be just read, and it's optimized for access speed.

The output of the classification command will be something like this (we used the -v verbose flag):

 Loaded fast dictionary from /usr/local/bin/FastDict.bin.
 Phrase minimum length set to 1 word.
 Phrase maximum length set to 2 words.
 Rank  Probability  Token
  1      0.99 your needs
  2      0.99 the clock
  3      0.99 receive future
  4      0.99 plain content-transfer-encoding
  5      0.99 net id
  6      0.01 mon oct
  7      0.99 link below
  8      0.99 gmt x-mailer
  9      0.99 get out
 10      0.99 future offers
 11      0.99 dyndns org
 12      0.99 dyndns
 13      0.99 debt
 14      0.99 d c
 15      0.99 build mime-version
ProbP = 0.0086875, ProbQ = 9.9e-29
Message junk probability: 1
JUNK

We're being shown the 15 most significant tokens found in the message; a probability of 0.99 is the maximum junk probability, while the ranking is determined by the number of times a token has been found inside junk mails in the past, and learned accordingly.

In this case the e-mail was recognised as junk, and indeed junk it was; otherwise the final verdict would have been MAIL instead.

Imagine that for some reason -- insufficient training? -- the above message got erroneously classified as MAIL; in that case, we would need to (re)train Annoyance Filter with that message, specifically directing it to 'learn it as junk.' This is easily accomplished entering:

af -v --read /usr/local/bin/Dict.bin --phrasemin 1 --phrasemax 2 --junk Sample1.txt --prune
--write /usr/local/bin/Dict.bin --fwrite /usr/local/bin/FastDict.bin

If you need to teach Annoyance Filter a legitimate mail instead, use --mail in lieu of --junk.

Afterwards we should check that the new training was enough by checking the classification once more. If the mail is now correctly showing as JUNK, fine; otherwise, repeat the above step as many times as necessary for achieving correct classification. Sometimes it's necessary to repeat the same training two or three times, especially when the dictionary is already heavily trained, or when the message is a borderline case -- that is, a message that could be either good or junk mail, that you want to classified in a different way than the software chooses.

Always remember the TOE principle -- Train Only Errors. If it ain't broken, don't fix it. Moreover, do not overdo training. The more you bias the dictionary toward junk (or good) e-mails by unnecessary learning, the more likely the software will misclassify messages in that direction. It's obvious, if you think of it: if you've got only nails, whatever you see resembles a hammer.

Even advanced users should Read The Fine Manual. It's enough to read page 5 through 9, to learn about more advanced options like:

--pdiag (possibly in association with --ptrace)
--biasmail and --newword
--sigwords
--treshjunk and --treshmail

Those options should be enough to keep you playing for a lifetime, but they are actually useful for heavily altering the default behavior of Annoyance Filter (which quite honestly fits most of us), and for gathering low-level diagnostics when things don't go the way they're supposed to.

A final note -- remember to back-up your Dict.bin dictionary from time to time. Why? Bayesian filters are a hell on earth for regression testing. When adding even one token to the dictionary, you could be surprised by the way it affects the general statistics. Sometimes this means your former excellent scores can go substantially down the drain.

You can also export the dictionary to a CSV-format comma-delimited text file, for fiddling around with the data -- not recommended -- or for merging the data with other dictionaries from (trusted!) friends and colleagues. Use the commands:

Export CSV data: af -v --read /usr/local/bin/Dict.bin --csvwrite DictList.csv
Import CSV data: af -v --csvread OtherDict.csv --write NewDict.bin
Import and Merge CSV data: af -v --read /usr/local/bin/Dict.bin --csvread OtherDict.csv --write NewDict.bin

Once you have Annoyance Filter working to your satisfaction you'll want to integrate it with your e-mail client. We'll talk about integrating Annoyance Filter with KMail in the final part of this series.

Corrado Cau has worked in the IT field for 15 years and spent most of his career as a system and network administrator on many platforms.

Share    Print    Comments   

Comments

on Training Annoyance Filter to combat spam

Note: Comments are owned by the poster. We are not responsible for their content.

Too much hassle!

Posted by: Anonymous Coward on October 29, 2003 11:15 PM
In the comments section of your previous article, you dismissed script-based (Perl, Python, etc.) solutions mostly because performance issues against compiled C/C++ code.

However, it seems too much hassle to manage the dictionary for the junk filter with Annoyance Filter. Almost every other anti-SPAM GPL solution out there offers some way to analyze incoming e-mails by itself and in some cases, like POPFile, offers a nice and intuitive interface so the user can deal with it.

Even the procmail rules for SPAMAssassin (which can work at the MTA or/and the MUA level) don´t seems so hard (perhaps "involved" would be a better word here) to manage in comparision with this.

While I´ll concede that performance issues should be taken into account, today´s machines are powerful enough to run those script-based anti-SPAM solutions without affect the responsiveness of the others applications and the OS overall.

I do think that such software should not be too much intrusive. Better yet, the user shouldn´t be too involved with the process of managing the classification rules. These things counts a lot in my book when I have to manage the rules on a timely basis.

#

Re:Too much hassle!

Posted by: Corrado on October 30, 2003 12:15 AM
Agree.

My personal favourite is the way Mozilla Thunderbird works: you just have a toggle button in the user interface, for learning new messages and/or amending the classification.

Unfortunately the integrated Bayesian filter isn't very good, in my experience.

All I can say is:

- I've supplied a bunch of utilities together with the article (a link to it will be out any moment).
Whith those utils, the hassle decreases by 80%.

- I've already had some success in integrating my scripts with the user interface of Sylpheed-Claws; unfortunately KMail doesn't offer (yet) such a flexibility.

- Once the dictionary is trained, you only need to update it from time to time. In my case, may be once or twice a month.

All in all, don't forget that Annoyance Filter at the moment is just the engine: as soon as a brave soul develops (or adapts) a user-friendly interface a la POPfile, that's it.

Corrado

#

Why do this?

Posted by: Anonymous Coward on October 31, 2003 02:45 AM
Why should I pull down and build this source? I might do it if I could get say 2x or 3x filter speed with kmail, but I don't see the claim that this filter is going to run faster on my old laptop than spamassassin.

The author does a fine job on documenting the work with this code. I am looking forward to the next part about adding kmail integration.

If this code matures and looses some of the hassle attributes that spamassassin (and other filters) have already done, and runs faster than the current well known filters, it would sure make life nicer for users dealing with older hardware like this old laptop I'm using right now!<nobr> <wbr></nobr>:)

Thanks!

Wishing you well.

#

Linux Desktop??

Posted by: Anonymous Coward on October 31, 2003 02:46 AM
Guess this is yet another example of why Linux is having a tough time getting on a lot of desktops. Annoyance-filter is WAY TO MUCH trouble to get going by an even above average user! I would suggest Mozilla Mail or Mozilla Thunderbird. Neither of these mail apps require this inordinate amount of configuration and spam recognition training.

#

You're on crack right?

Posted by: Anonymous Coward on November 02, 2003 04:42 AM
Is this article a joke? I've been a commandline junkie for over ten years and I'd *NEVER* go to all of that trouble to manage spam. I think you need to take a look at dspam. All you have to do is bounce your message to a spamtrap address and the rest of the work is done for you.

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya