The Evolution Spam Filter

From Computer Tyme Support Wiki

(Difference between revisions)
Jump to: navigation, search
(The New Concept)
(The New Concept)
Line 37: Line 37:
the other filters are all about matching. The Evolution Filter is about NOT MATCHING. It's all about what the other side never says. While other filters match what's inside the box, the Evolution Filter matches what's outside the box. Instead of matching finite sets of known information, we are matching to the infinite set of unknown information. And we all know that an infinite set is always bigger than a finite set. How much bigger? Infinitely bigger! That's why it works so well.
the other filters are all about matching. The Evolution Filter is about NOT MATCHING. It's all about what the other side never says. While other filters match what's inside the box, the Evolution Filter matches what's outside the box. Instead of matching finite sets of known information, we are matching to the infinite set of unknown information. And we all know that an infinite set is always bigger than a finite set. How much bigger? Infinitely bigger! That's why it works so well.
 +
 +
=== Example of how NOT matching works ===
 +
 +
Let’s take 2 subject lines and see how this works.
 +
 +
“Meet hot Russian Brides Online!”
 +
“I read an article about Russian Brides in a magazine”
 +
 +
A traditional spam filter using Bayesian or hard coded rules about “Russian Brides” might determine that only 1 out of 500 emails mentioning the phrase “Russian Brides” is a good email. Thus the second line would have points assessed against it in the classification process using these traditional methods.
 +
 +
Using the Evolution Filter the phrase “Russian Brides” is in both sets and therefore has no influence on the results. But the first subject matches these phrases in the Spam Only set.
 +
 +
“Meet hot”
 +
“Meet hot Russian”
 +
“Meet hot Russian Brides”
 +
“hot Russian Brides Online!”
 +
“Russian Brides Online!”
 +
“Brides Online!”
 +
“Online!”
 +
 +
The second subject matches these phrases on the ham only set that are never used on the spam set.
 +
 +
“I read an article”
 +
“read an article”
 +
“read an article about”
 +
“about Russian”
 +
“an article about”
 +
“in a magazine”
 +
“Brides in a”
 +
 +
So even though the phrase “Russian Brides” has no influence each subject hits either ham or spam many times where the same phrase was never used in the subject line in the opposite set. And the number of hits is significant enough just from these subjects to cause the fingerprints to be learned, and that’s just looking at the Subject attribute. When this is combined with testing all attributes the messages usually come out strongly on one side or the other.
 +
 +
In rule based systems one would not normally build a white list rule to to allocate points based on seeing the phrase “read an article about”. That’s where the Evolution Filter is different. It didn’t need to have that rule because since it is comparing to the infinite set of what is not matched on the other side, it dynamically create billions of rules automatically.

Revision as of 21:51, 28 January 2016

Contents

The Evolution Filter

The last big advancement in spam filtering was done by Paul Graham - A Plan for Spam. He was the first to apply Bayesian Filtering to blocking email back in 2002. Since then not a lot has happened to make spam filtering significantly better, till now. The Evolution Filter is a new plan for spam.

Overview of How it Works

Have you ever looked at a list of email messages and get the feeling that you can classify 70% of them just seeing the sender name and the subject line? Have you ever wondered how it is so easy to recognize ham and spam yet computer can't seem to figure out the obvious. When you look at the list - how is it that you can tell? What's going through your mind?

The way you recognize spam and ham is that when you see a subject line that looks similar to good email and dissimilar to spam then it's good email. And if it looks similar to spam because it says stuff that good email never says, it's spam.

For example, the subject line is "let's get some dinner". We know the message is good because spammers never say that. So if the subject is something you've seen before in good email and something that spammers never say - it's good email.

In fact - that's how the Evolution filter works. If, for example, the subject has words and phrases that match good email but spammers never say then the message is good. If the subject has words and phrases that match messages that spammers have used, but never seen in good email, then it's spam.

But, you might ask, where do I get a list of words and phrases that spammers never say? It's easier than you think. What I do is create a set of every word and phrase spammers do say and test to see if it's NOT on the list. In other words, I store all the words and phrases that are said in ham, and all the words and phrases that are said in spam. If the test message matches ham and doesn't match spam, it's ham. And if it matches spam and doesn't match ham, it's spam.

So - what do I mean by words and phrases? I take the subject and I break it down.

"the quick brown fox jumps over the lazy dog"

becomes

"the" "quick" "the quick" "brown" "quick brown" "the quick brown" "fox" "brown fox" "quick brown fox" 
"the quick brown    fox" "jumps" "fox jumps" "brown fox jumps" "quick brown fox jumps" "over" "jumps over"
"fox jumps over" "brown fox jumps over" "the" "over the" "jumps over the" "fox jumps over the"
"lazy" "the lazy" "over the lazy" "jumps over the lazy" "dog" "lazy dog" "the lazy dog" "over the lazy dog"

I'm using a database called Redis. Redis lives in ram and it's very fast. It is extremely good at set comparisons. So I can create a spam corpus and a ham corpus that contain million of words and phrases and I can break the subject down into hundreds of fingerprints and compare it to both sets and see what matches and not matches to get a result. The furmula is as follows.

card(Test_message intersect Spam diff Ham) - card(Test_message intersect Ham diff Spam)

This is a simplification of the concept. In real life I'm not just testing the subject. I'm testing the name part of the from address, the attachment file names and extensions, php scripts, parts of the message body, the header structure, and behavior of the sender. All these produce scores that are combined into a result that is far more accurate than any method of sorting email ever developed.

The New Concept

The way the Evolution Spam Filter works is literally thinking outside the box. Every other filter is about matching something. One box has ham, one box has spam. Bayesian filters match the message with the two boxes to see which one is most similar. Or the filter is matching rules. Does it contain "Russian Brides"? If so - add penalty points.

the other filters are all about matching. The Evolution Filter is about NOT MATCHING. It's all about what the other side never says. While other filters match what's inside the box, the Evolution Filter matches what's outside the box. Instead of matching finite sets of known information, we are matching to the infinite set of unknown information. And we all know that an infinite set is always bigger than a finite set. How much bigger? Infinitely bigger! That's why it works so well.

Example of how NOT matching works

Let’s take 2 subject lines and see how this works.

“Meet hot Russian Brides Online!”
“I read an article about Russian Brides in a magazine”

A traditional spam filter using Bayesian or hard coded rules about “Russian Brides” might determine that only 1 out of 500 emails mentioning the phrase “Russian Brides” is a good email. Thus the second line would have points assessed against it in the classification process using these traditional methods.

Using the Evolution Filter the phrase “Russian Brides” is in both sets and therefore has no influence on the results. But the first subject matches these phrases in the Spam Only set.

“Meet hot”
“Meet hot Russian”
“Meet hot Russian Brides”
“hot Russian Brides Online!”
“Russian Brides Online!”
“Brides Online!”
“Online!”

The second subject matches these phrases on the ham only set that are never used on the spam set.

“I read an article”
“read an article”
“read an article about”
“about Russian”
“an article about”
“in a magazine”
“Brides in a”

So even though the phrase “Russian Brides” has no influence each subject hits either ham or spam many times where the same phrase was never used in the subject line in the opposite set. And the number of hits is significant enough just from these subjects to cause the fingerprints to be learned, and that’s just looking at the Subject attribute. When this is combined with testing all attributes the messages usually come out strongly on one side or the other.

In rule based systems one would not normally build a white list rule to to allocate points based on seeing the phrase “read an article about”. That’s where the Evolution Filter is different. It didn’t need to have that rule because since it is comparing to the infinite set of what is not matched on the other side, it dynamically create billions of rules automatically.

Personal tools