Spam filter - Everything2.com

(thing)

by Belgarion

Wed Nov 15 2000 at 12:43:40

So far, I could come up with only a few filter rules for spam mail:

Contains the words "reply" and "to unsubscribe"
I am not in the TO: or CC:
Contains the line "this is not spam"
Contains "!!!" or "$$$" in the subject line

Additions are welcome of course, I need new ideas.

I like it!

(thing)

by yerricde

Wed Nov 15 2000 at 13:10:36

Grep your inbox for this case-insensitive regular expression, provided to the public by an anonymous coward on Slashdot:

remove\.html|for full details|huge savings|act now| mlm[^a-z]|works!|here.s how:|here.s how [^I]|remove[0-9-]*@|order now|now revealed|chain letter|100% guaranteed|money back|like to order|easy to order| to order.*free|age.* to order|year.* to order|for detailed information|free copy|special offer|cost\?|To be removed|absolutely free|no risk involved|absolutely no risk|with no risk|entrepreneur|added bonus|extra income|don.t delay|[^e]send check|money-order|dear sir|1-[89]00|Visa|under 18[,. ]

Move every message that matches it to a folder called bulk_mail or something. This should make your daily trip to SpamCop.net a bit faster.

I like it!

(thing)

by mblase

Mon Apr 16 2001 at 20:33:49

I use a webmail account to handle all of my personal e-mail, and over the years the address has gotten around to quite a few spam lists. Fortunately, my account has filters included for sorting e-mail into folders; unfortunately, it's not very sophisticated--I can only filter on "To:", "Cc:", "Subject:" headers, plus the message body, and I have a short limit on the maximum number of filters.

With practice and experience, however, I've found that a few tricks can trap most of the junk e-mail I get on a daily basis.

Filter headers for your own e-mail address. If neither the "To:" nor the "Cc:" field contains my e-mail address, I send the message to the trash. Many "cheap" spam senders put large numbers of addresses in the "Bcc:" field to distribute as quickly as widely as possible. I do subscribe to a handful of e-mail newsletters that use Bcc: as well, but by monitoring my trashed e-mails for a couple of weeks I was able to identify all of these and re-filter them into my inbox.

Unsubscribe actively. Of the e-mails that still get through, many of them are sent by "legitimate" mass e-mail senders. Not all, not even most, but many. When these e-mails have a Web page I can use to unsubscribe from their mailings, I do so. When they only provide an e-mail address to unsubscribe, however, this is invariably either fake or set up to confirm your e-mail address as a valid recipient; when the "unsubscribe" e-mail address is at another webmail provider (yahoo.com or hotmail.com, typically), it's always fake and just there to fool me into thinking it's a "legitimate" mass mailing. But in general, actively unsubscribing from unwanted spam has decreased my daily flow. (Incidentally, this works against telemarketers just as well.)

Filter non-English (or non-your-language-here) characters. Those two tricks worked so well, I didn't need to change it for years--until somehow I got on a few dozen Chinese, Korean and Japanese spam lists. My email is in the "To:" header, but I can't read the message to find the "unsubscribe" link (if there is one). Then I noticed something: while the body of the message was rendered into non-English characters by my browser, the "Subject:" field was not. I looked at a couple day's worth of Asian spam, noticed that the 'Á' and 'À' characters were present in nearly all of them. I added a filter to send all messages with those characters in any header to the trash, and voila--immediate 95% reduction in non-English spam.

None of these tricks has ever successfully eliminated all of my daily spam, but I'm surprised every time I look in my Trash folder at how much it does catch--in a typical day, I might see five e-mails trapped by the system's built-in spam blockers, three get through to my inbox, and fifteen or twenty sent directly to the Trash. Not as good as SpamCop, perhaps, but I'm generally pleased.

As a final note, it's generally agreed that Hotmail and AOL have poor to useless spam filters set up on their own servers, while Yahoo! Mail is much more effective. Still, Yahoo! can't catch everything, and I use the above filters to help trap the junk where Yahoo! can't.

I like it!

(idea)

by elem_125

Wed Mar 12 2003 at 1:41:57

Using Bayesian Filters to catch spam

The contents of this write-up is based largely on the work of Paul Graham who is one of the pioneers of the idea of using Bayesian filters to separate spam from non-spam e-mails. The two essays of his that I am basing most of this on are these:

"A Plan for Spam" - http://www.paulgraham.com/spam.html
"Better Bayesian Filtering" - http://paulgraham.com/better.html

Note however that I am not just re-iterating the above articles but, hopefully, helping explain them more fully and in a way that is more easily accessible (for example the articles use code examples in LISP, a language clearly not made for the mortal mind) as well as adding my own commentary about spam filtering and the issues surrounding spam.

To start off we first should define what we chose to call spam. I am going to choose to call as spam any e-mails that I am sent, that have been sent automaticly, that I do not want. I make the distinction of it having been sent automaticly since I get e-mail from individual people that I do not want, yet it is not spam. This definition also, potentially, includes e-mails from so-called 'opt-in' lists. In many cases I am now subscribed to those lists simply because I forgot to check or uncheck a box, not because I actually wanted those e-mails. So I'd rather that they disappeared from my sight also. I know that this does not follow a technical or (potential) legal definition of spam, but I feel confident that it follows a fairly good 'practical definition' of spam.

Currently our most effective anti-spam solutions are based on filters that recognize features within the e-mail. One of the most well known of these filtering programs is SpamAssassin. Basicly these programs work by looking for certain known strings within an e-mail. For example this is the report that came with a recent piece of spam that I received:

SPAM: -------------------- Start SpamAssassin results ------------------ 
SPAM: The WELL is running spam tagging software called SpamAssassin. 
SPAM: This software has flagged this message as probable spam. For more 
SPAM: information on SpamAssassin on The WELL, and how to configure it 
SPAM: to work for you, please go to http://www.well.com/help/spam/ 
SPAM: 
SPAM: Content analysis details: (8.50 hits, 6 required) 
SPAM: CUM_SHOT (2.3 points) BODY: Possible porn - Cum Shot 
SPAM: BEST_PORN (0.5 points) BODY: Possible porn - Best, Largest Porn Collections 
SPAM: EXTRA_CASH (0.4 points) BODY: Offers Extra Cash 
SPAM: HOT_NASTY (0.4 points) BODY: Possible porn - Hot, Nasty, Wild, Young 
SPAM: CLICK_BELOW (0.3 points) BODY: Asks you to click below 
SPAM: SPAM_PHRASE_03_05 (1.1 points) BODY: Spam phrases score is 03 to 05 (medium)
SPAM: score: 3
SPAM: BIG_FONT (0.3 points) BODY: FONT Size +2 and up or 3 and up 
SPAM: HTML_FONT_COLOR_RED (0.3 points) BODY: HTML font color is red 
SPAM: LINK_TO_NO_SCHEME (1.3 points) BODY: Contains link without http:// prefix 
SPAM: CLICK_HERE_LINK (0.3 points) BODY: Tells you to click on a URL 
SPAM: MSG_ID_ADDED_BY_MTA_3 (0.9 points) 'Message-Id' was added by a relay (3) 
SPAM: CTYPE_JUST_HTML (0.4 points) HTML-only mail, with no text version 
SPAM: 
SPAM: 
SPAM: Please remove this report if reporting this message to SpamCop. 
SPAM: -------------------- End of SpamAssassin results -----------------

The e-mail in question was trying to sell me hard-core porn videos.

So we can see that SpamAssassin is finding certain strings in the text of the e-mail or within the HTML of the the e-mail. It then assigns a point value to each of the phrases that has triggered it and then marked it as spam since the the total point value is greater than a threshold that I have specified.

But there is a large problem with this. What do the points really mean? What causes "Cum Shot" to get a score that is six times larger than "Hot, Nasty" does? The problem is that no one really knows what these numbers mean, not even the developers who have assigned the values. This e-mail got a score of 8.5 and so it failed the spam test and got marked as such. But lets compare this to the scores of two other e-mails. Firstly the Everything Daily Report. The EDR gets a spam score of 1.4, well below my threshold of 6. Apparently it failed the following tests: "NO_REAL_NAME, SPAM_PHRASE_00_01, SUBJECT_FREQ". Secondly, a hand-typed e-mail from my girlfriend: this e-mail gets a score of 3.6, in turn it failed the following tests: "FROM_ENDS_IN_NUMS, MIME_LONG_LINE_QP, NO_REAL_NAME, SPAM_PHRASE_03_05".

Hold on... lets look at that again... EDR - 1.4, personally written e-mail from girlfriend - 3.6??? How does a hand written e-mail get a score over 2.5 times higher than an automated, strictly formated e-mail sent by E2? This indicates a major weakness in automated filtering.

The other two major problems with programs like SpamAssassin are false positives and false negatives. Of the two false positives are a far greater problem. If you have effective spam filters then you are unlikely to check your spam for any real e-mails that may have fallen through the cracks. But in the real world that could have been an important e-mail that you just missed. So it is vital that any filter doesn't have any false positives. So far with SpamAssassin I've had about 5 or 10 false positives in the last year and several thousand e-mails. So its not doing too badly there. False negatives are less important, but still annoying. Currently for every 10 pieces of spam that I get about 2 will slip through the filters and end up in my Inbox. This is not an acceptable percentage.

There is also one problem that applies to all 'Feature Recognizing' filters. Once you know what the filters are looking for then its very easy to get e-mail past them. The filters are not dynamic and don't learn. So you just end up with an endless cycle of increasingly intelligent spam against more and more broad-sweeping filters. The situation is fairly similar to that of virus writers against anti-virus software. Ultimately the anti-spam filters will always be lagging behind the spam.

So whats the solution to this problem?

Well its looking as if statistical filtering could provide the ideal solution to these problems. Basicly statistical filtering works by assigning to each e-mail the probability that it is spam. It works by analyzing the words in the e-mail and looking up the probability of each word being in a spam e-mail. We then use Bayes's Rule to work out the overall probability of the e-mail being spam.

So let us outline the system that Graham gives in his two articles. Firstly we need a large corpus of spam and of non-spam. We then feed these into a program that counts the number of instances of every word (where a word can also be a number such as "$9.99" or an IP address, or a URL and also e-mail headers are included in the data that is parsed) and stores them in to two tables, one for spam and one for non-spam. We then generate a third table, giving the probability that each word appears in a spam e-mail. To generate the third table we use the following formula (translated from LISP, all errors are mine). Note that it is more 'valuable' for a word to appear in the non-spam corpus than the spam corpus, so we double the number of occurrences of any word in the non-spam corpus.

let nbad equal the number of spam e-mails in the corpus
let ngood equal the number of non-spam e-mails in the corpus

for all words w in {list_of_spam_words, list_of_non-spam_words}
    let b equal the number of occurrences of w in spam e-mails
    let g equal the number of occurrences of w in non-spam e-mails

                                 (  b  )
                                 (-----)
                                 ( nbad)
    let p_is_spam equal  -----------------------
                          ( 2g  )       (  b  )
                          (-----)   +   (-----)
                          (ngood)       ( nbad)    
next w

Note that the maximum value of b/nbad and g/ngood is 1. Also note that: 0.001 < p_is_spam < 0.999 or in other words, p_is_spamcan be at most 0.999 and at least 0.001. The reason for adding more weight to non-spam words is to reduce the instances of false positives. We can easily see that if a words tends to regularly appear in both your spam and non-spam e-mails (if, for example your SO enjoys e-mailing you kinky stories) then we want to make sure that an e-mail with that word in it still has a good chance of being classed as non-spam and just just thrown straight into the spam box.

So now that we have our nice table of probabilities how do we use it?

Simple, for every e-mail that we now get we run it through a simple program. We first take the 15 words that are furthest away from a neutral 0.5 chance of appearing in spam, so for example 0.2 is just as far away as 0.8 - basicly we take the 15 highest values for |p_is_spam(w) - 0.5| - where p_is_spam is just looked up from the table we generated earlier. We then apply Bayes's Rule to these numbers to come up with a probability that the entire e-mail is spam. Since we have taken 15 words, we will label the probability of each word being spam from a to o. So we now apply the following formula:


             abc...o
----------------------------------- = probability that the e-mail is spam
 abc...o + (1-a)(1-b)(1-c)...(1-o)

If this final value is over 0.9 then we categorize the e-mail as being spam.

There one problem remaining though. How do we classify words that we haven't come across before? Since most spam tends towards the same vocabulary Graham suggests, in his articles, using a value 0.4. This value is sufficiently neutral to not skew the filter too much, yet it takes advantage of the regular vocabulary in spam to give the word 'benefit of the doubt' and push it slightly towards the non-spam side.

One of the most important parts of the process is in creating the original two word lists with your spam and non-spam corpi. It is very important, especially for the non-spam corpus, to use your own e-mails. This is so that words that commonly appear in your own e-mails can be picked up and correctly classified. For example someone who does a lot of coding will most likely have words like "C" or "char" appear often in your e-mail. So you want to make sure that they will be good indicators of non-spam.

How effective is this all?

The only numbers I have to go on for the effectiveness of this technique are the results that Graham has published in his articles. With his first filters (Graham altered his treatment of certain words afterwards) he claims 99.5% of spam filtered out, with 0.03% false positives, although all the false positives were 'spamesque'. I hope to be able to verifiy these numbers soon by writing my own filtering software

Graham notes that once he altered his parsing routines to be more discriminating then the performance of the filters goes up (although the performance difference from 99.5% to 99.75% seems fairly negligible to me). The improvements he made were to make the the parsing routines more fine-grained. For example to differentiate between a word appearing in the body of an e-mail and the subject line of an e-mail. He also allowed for unknown words to degrade untill they match a know word. This means that a word such as "Sexy!!!!!!" could degrade down to a known word such as "Sexy!".

So far this form of probabilistic filtering is showing great promise. Yet there are still improvements that can be made. The main problem is e-mails that are written in a non-spam style. Any e-mail that is written in a more natural style (with out huge HTML tags, no exclamation marks etc.) is far more likely to get through this filters. Graham calls these the "spam of the future". An example of this sort of e-mail is:

From: 	 Cindy 
To: 	 #####@well.com
Subject: why havn't you been online lately?
Date: 	 Sun, 9 Mar 2003 22:26:02 GMT	

Hi Gorgeous.
Here are the pics I promised
http://www.Cindy.com/breasts

If my father could see me now he would kill me :)
Anyway's I hope the pics turn you on :) 

Here are some of my home movies
http://www.Cindy.com/dirtymovies

Yours Truly
Cindy Stellick

Although it is easy to see by sight that this is spam, the style in which it is written is quite natural and uses words that are more likely to appear in our non-spam corpus of e-mails.

Thankfully these sort of e-mails are harder for spammers to write, and will become increasingly hard to write as statistical filtering takes off. For them to succeed they need to have enough words from our non-spam corpus to offset any 'spam words' that might be included as well. This means that the e-mails have to be more personalized towards each person that they are sent to.

To conclude, I think that this sort of mail filtering will be the start of a new war on spam, one where we will be a step ahead for the first time. If enough people use effective filtering software, or if the ISP's start to use it then the response rate to spam will hopefully drop so low that it will no longer be of economic benefit for spamers to hawk their wares in this way.

At last we shall be free...

I like it!

3 C!s

SpamAssassin	procmail recipe to drop mail that's not for you	SpamCop	avoiding spam email
procmail recipe to drop (forged) Hotmail spam	Distributed Checksum Clearinghouse	W32/SirCam@MM	grep
Bayesian Network	disemvoweling	aladdin pants	GMail contact list security hole
Spamusement	SPEWS	mail headers	twit filter
M2	Bayesian	Complicated	KMail
DUL	Scunthorpe