Articles   Home

Statistical Spam Filter Review

Since October 2002 I have used a probability based statistical spam filter script written in ksh and awk inspired by Paul Graham. It is a personal email filter I wrote for Unix-Linux type hosts, not a mail server filter. The link above includes the script and how I think it works. The spam filter is not Bayesian, but seems to have similar results to other probability based spam filters I have seen reviewed, and similar strengths and weaknesses.

A statistical approach by itself may misclassify some good mail, for my filter so far it has been HTML encoded mail. However, combined with a "learning" type whitelist of good addresses it limits that problem to good email whose address is unknown to the whitelist. For my mail mix it is a very small subset of mail. The white list consists of addresses that I generate from my body of "good" email when I train the spam filter.

Problems could occur with my whitelist approach. When the whitelist passes the email on, it is a "trusted" email and no further filtering happens. This could pass an email that has attacked the host of a whitelisted address like a worm or virus, or a forged address on the white list. I have tried to remove the most commonly forged addresses, like my own, but I have noticed a few attacks by viruses like this, but only a very few. The whitelist improves overall performance of the filter as it is several magnitudes of 10 times faster compared to the rest of the filtering script. The filter runs the whitelist first, then the probability filter. With the current set of training email it takes 8/10's of a second to filter an email on an old Sun Ultra-80 using NAWK when it is not recognized as an address from the whitelist. A 750mhz AMD Linux box is much faster using GAWK.

In my filter script most of the computing time is spent loading the associative array with tens of thousands of tokens and spaminess values. This happens for each email received. The bigger this associative array, the slower the filter, it is currently at 40,000+ tokens. The lookup on an associative array is very fast and the average number of tokens looked up in email does not change much, it is the loading of the array in memory that is the performance problem. Changing the language, as GAWK does not have a DBM interface and using DBM type files instead of building a large associative array in memory may speed performance. Also running it as a daemon somewhere in the processing of mail would limit the performance problem to starting the daemon and the creating of the token-value associate array in memory, but that also requires a redesign and possibly privileges a regular user would not have.

To stabilize current performance and space I limit the training set of emails. The current mix of training email has 1,700 spam emails and 1,600 good emails. Every so often I prune back the spam mail, rarer the good mail files, as a result, the set of training mail does not reflect my real world 80-90% spam rate but is closer to a 50-50% mix which I am sure biases my good token spam scores. I have not tested what it does to my filter effectiveness, and I may look at this later. 7-10 MB is all the space I want to devote to the filter, and its files of email and spam.

A test on one thousand mails not seen by the filter training gave me 93+% of the spam caught, this filter works well enough for me. These results are low compared to some other probability based filters, but higher than most rule based filters. I do not claim to be the best, just OK. The test script is bundled with the filter script referenced above. Real world results may be better (or worse) as the 1000 emails were run through as a batch so the filter was not trained on the 1000 emails as they were processed, so no additional "learning" happened. I generally train the filter every few days, the 1000 emails tested was about a months worth of mail. In the test set about 800 mails were spam, about 200 were good, and no good mail was misclassified by this test, but that was just the luck of the draw in the sample.

LATER NOTE: I did look at the ratio of spam to good email in my training set and I found much better results when I changed to a ratio of spam/good that was closer to the actual mix of spam coming in the mail. I was able to cut my training set to about 1000 spam emails, and 200-300 good mails and it has been a month since a spam has gotten through.

The incidents of misclassifying good email as spam for the last year or so were all the same type: HTML encoded email. The incidents are rare, about 1/1000 good emails, but painful. An example was an HTML encoded rebate email from Dell. What a pain, the HTML encoding gave it a super high spam rating. When decoded to ascii text it had a very low spam rating.

The spam email that does get through the filter mostly falls into the following categories:

- Base64 encoded MIME.

I do not decode these and there are very few tokens that the filter finds "interesting" in the header. The long tokens in the base64 message are not common enough to be recognized by the filter. Decoding and then filtering the email would remove this problem. However, this type of spam have not been very common in the last few months in my mail.

- Nigerian scams and other very long winded spams. This seems the most common to escape the filter. Fewer pass the filter when the training set of email got larger. These spams use lots of tokens and they bury the spam under a blizzard of words, some of which counters the weight of the spam words. This may be an effect of the way I calculate "spaminess", as an average of token probability, not a typical bayes type of calculation. See my link above to my script for my method of calculating spaminess of email.

- HTML Comment loaded text and fake HTML tags. This busts the spammy tokens into pieces and loads up the filter with a lot of tokens. Sometimes it has a Base 64 effect, creating long tokens that are not common enough to be recognized by the filter. These have become much less common lately in my mail mix.

- Further work:

Decoding encoded emails (Base64 MIME, and HTML) would catch many of the problems this filter has with MIME and HTML encoded spam, including misclassification of good email. The Nigerian Scam type spam would still be a problem, it is probably an effect of the method I use to calculate my spam values, see my link above to see the lame method I use.

Multiple token classification is another way of improving the filter accuracy. My current filter looks at single tokens, not token stream subsets or token proximity. Token subsets like phrase analysis are already used in probability based systems of classification. CRM114 uses phrase analysis techniques and has links to explain some of these methods in greater detail.

Would I use unfiltered mail again? No. Even my unsophisticated, crude statistical filter script works better than unfiltered email or rule based schemes. My results are from the characteristics of my filter and the mail I receive but they are not way out of sync with other statistical filters I have seen reviewed.