The Evolution Spam Filter

From Computer Tyme Support Wiki

(Difference between revisions)
Jump to: navigation, search
(Email Testing System)
(The Evolution Filter)
 
(47 intermediate revisions not shown)
Line 1: Line 1:
 +
= NO Patent =
 +
 +
Although this was originally written up as a patent application, I have decided NOT to pursue the patent. I am releasing this under the GPL2 license. So - go forth and steal this idea. Us it to fight spam. But if you improve it you have to share it. If I'm giving it away - you're giving it away. GPL2 license. If you improve it you have to share it.
 +
= The Evolution Filter =
= The Evolution Filter =
-
The last big advancement in spam filtering was done by [http://www.paulgraham.com/spam.html Paul Graham - A Plan for Spam]. He was the first to apply [https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering Bayesian Filtering] to blocking email back in 2002. Since then not a lot has happened to make spam filtering significantly better, till now. The Evolution Filter is a new plan for spam.
+
The last big advancement in spam filtering was done by [http://www.paulgraham.com/spam.html Paul Graham - A Plan for Spam]. He was the first to apply [https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering Bayesian Filtering] to blocking email back in 2002. Since then not a lot has happened to make spam filtering significantly better, till now. '''The Evolution Filter is a new plan for spam.'''
I chose the name "Evolution Filter" because it's a very simple and fast recursive learner that mimics evolution in software. While most neural networks are complex AI, this one is dirt simple but extremely fast. It seems that simple and fast is the key. In 6 months it might become self aware. :)
I chose the name "Evolution Filter" because it's a very simple and fast recursive learner that mimics evolution in software. While most neural networks are complex AI, this one is dirt simple but extremely fast. It seems that simple and fast is the key. In 6 months it might become self aware. :)
 +
 +
The Evolution Filter is a Trademark of [http://www.junkemailfilter.com Junk Email Filter].
== Overview of How it Works ==
== Overview of How it Works ==
Line 11: Line 17:
The way you recognize spam and ham is that when you see a subject line that looks similar to good email and dissimilar to spam then it's good email. And if it looks similar to spam because it says stuff that good email never says, it's spam.
The way you recognize spam and ham is that when you see a subject line that looks similar to good email and dissimilar to spam then it's good email. And if it looks similar to spam because it says stuff that good email never says, it's spam.
-
For example, the subject line is "let's get some dinner". We know the message is good because spammers never say that. So if the subject is something you've seen before in good email and something that spammers never say - it's good email.
+
For example, the subject line is "let's get some dinner". '''We know the message is good because spammers never say that.''' So if the subject is something you've seen before in good email and something that spammers never say - it's good email.
 +
 
 +
<div align="center"><h3>If the Subject matches Ham and does not match Spam - it's Ham. <br>
 +
If the Subject matches Spam and doesn't match Ham, it's Spam.</h3></div>
In fact - that's how the Evolution filter works. If, for example, the subject has words and phrases that match good email but spammers never say then the message is good. If the subject has words and phrases that match messages that spammers have used, but never seen in good email, then it's spam.  
In fact - that's how the Evolution filter works. If, for example, the subject has words and phrases that match good email but spammers never say then the message is good. If the subject has words and phrases that match messages that spammers have used, but never seen in good email, then it's spam.  
-
But, you might ask, where do I get a list of words and phrases that spammers never say? It's easier than you think. What I do is create a set of every word and phrase spammers do say and test to see if it's NOT on the list. In  other words, I store all the words and phrases that are said in ham, and all the words and phrases that are said in spam. If the test message matches ham and doesn't match spam, it's ham. And if it matches spam and doesn't match ham, it's spam.
+
<div align="center"><h3>Instead of matching against finite sets this filter works on NOT matching. <br>
 +
Not matching is like matching against an infinite set.</h3></div>
-
So - what do I mean by words and phrases? I take the subject and I break it down.  
+
But, you might ask, where do I get a list of words and phrases that spammers never say? It's easier than you think. What I do is create a set of every word and phrase spammers do say and test to see if it's '''NOT in the list'''. In  other words, I store all the words and phrases that are said in ham, and all the words and phrases that are said in spam. If the test message matches ham and doesn't match spam, it's ham. And if it matches spam and doesn't match ham, it's spam.
 +
 
 +
So - what do I mean by words and phrases? I take the subject and I break it down using sequential tokenization.  
  "the quick brown fox jumps over the lazy dog"
  "the quick brown fox jumps over the lazy dog"
-
becomes
+
becomes ...
  "the" "quick" "the quick" "brown" "quick brown" "the quick brown" "fox" "brown fox" "quick brown fox"  
  "the" "quick" "the quick" "brown" "quick brown" "the quick brown" "fox" "brown fox" "quick brown fox"  
-
  "the quick brown   fox" "jumps" "fox jumps" "brown fox jumps" "quick brown fox jumps" "over" "jumps over"
+
  "the quick brown fox" "jumps" "fox jumps" "brown fox jumps" "quick brown fox jumps" "over" "jumps over"
  "fox jumps over" "brown fox jumps over" "the" "over the" "jumps over the" "fox jumps over the"
  "fox jumps over" "brown fox jumps over" "the" "over the" "jumps over the" "fox jumps over the"
  "lazy" "the lazy" "over the lazy" "jumps over the lazy" "dog" "lazy dog" "the lazy dog" "over the lazy dog"
  "lazy" "the lazy" "over the lazy" "jumps over the lazy" "dog" "lazy dog" "the lazy dog" "over the lazy dog"
 +
 +
Or - I might use combination tokenization:
 +
 +
"A B C D E"
 +
 +
becomes ...
 +
 +
“A” “AB” “B” “C” “AC” “ABC” “BC” “D” “AD” “ABD” “BD” “CD”
 +
“ACD” “ABCD” “BCD” “E” “AE” “BE” “CE” “ACE” “BCE” “DE” “ADE”
 +
“ABDE” “BDE” “CDE” “ACDE” “ABCDE” “BCDE”
 +
 +
See [http://www.junkemailfilter.com/patent/patent3.pdf Figure 3]
I'm using a database called Redis. Redis lives in ram and it's very fast. It is extremely good at set comparisons. So I can create a spam corpus and a ham corpus that contain million of words and phrases and I can break the subject down into hundreds of fingerprints and compare it to both sets and see what matches and not matches to get a result. The furmula is as follows.
I'm using a database called Redis. Redis lives in ram and it's very fast. It is extremely good at set comparisons. So I can create a spam corpus and a ham corpus that contain million of words and phrases and I can break the subject down into hundreds of fingerprints and compare it to both sets and see what matches and not matches to get a result. The furmula is as follows.
Line 38: Line 62:
The way the Evolution Spam Filter works is literally thinking outside the box. Every other filter is about matching something. One box has ham, one box has spam. Bayesian filters match the message with the two boxes to see which one is most similar. Or the filter is matching rules. Does it contain "Russian Brides"? If so - add penalty points.
The way the Evolution Spam Filter works is literally thinking outside the box. Every other filter is about matching something. One box has ham, one box has spam. Bayesian filters match the message with the two boxes to see which one is most similar. Or the filter is matching rules. Does it contain "Russian Brides"? If so - add penalty points.
-
the other filters are all about matching. The Evolution Filter is about NOT MATCHING. It's all about what the other side never says. While other filters match what's inside the box, the Evolution Filter matches what's outside the box. Instead of matching finite sets of known information, we are matching to the infinite set of unknown information. And we all know that an infinite set is always bigger than a finite set. How much bigger? Infinitely bigger! That's why it works so well.
+
the other filters are all about matching. The Evolution Filter is about '''NOT MATCHING'''. It's all about what the other side never says. While other filters match what's inside the box, the Evolution Filter matches what's outside the box. Instead of matching finite sets of known information, we are matching to the infinite set of unknown information. And we all know that an infinite set is always bigger than a finite set. How much bigger? Infinitely bigger! That's why it works so well.
=== Example of how NOT matching works ===
=== Example of how NOT matching works ===
Line 112: Line 136:
=== Foreign Languages ===
=== Foreign Languages ===
-
One of the advantages of the Evolution Filter is that it makes its own rules. It doesn't need to know anything about the language a message is written in. As long as it has enough samples it can figure out on its own if something is spam or ham. It can be French, Spanish, German, Russian, or even Klingon spam. It can actively detect good email and spam just through the learning process.
+
One of the advantages of the Evolution Filter is that it makes its own rules. It doesn't need to know anything about the language a message is written in. As long as it has enough samples it can figure out on its own if something is spam or ham. It can be French, Spanish, German, Russian, or even Klingon spam. It can '''actively detect good email''' and spam just through the learning process.
Examples of Klingon spam:
Examples of Klingon spam:
Line 121: Line 145:
  Are you a descendant of the House of Kahless?
  Are you a descendant of the House of Kahless?
  Enhance your Manhood! Romulan Pharmacy Online!
  Enhance your Manhood! Romulan Pharmacy Online!
-
  Bottom Price Risa vacation, Jamaharon included!!
+
  Bottom Price Risa Vacation, Jamaharon included!!
 +
 
 +
=== Protecting Good Email ===
 +
 
 +
Most spam filters focus on blocking spam. They identify spam and what they fail to identify is ham. The Evolution Filter can actively identify both spam and ham. In fact '''it is actually better at positively identifying good email''' than bad email. The trick to identifying good email is by detecting '''words and phrases that spammers never say'''.
 +
 
 +
Although the name of spam filtering businesses is blocking spam, the real mission is to not block good email. If a customer gets a few spams sneak through it's no big deal. But if I block important emails then people get upset. Spam filters that focus on identifying spam need to be somewhat week so that they don't misidentify good email as spam. However the Evolution Filter can positively identify good email strongly in a way that can override the results of spam detection and save the good email from being rejected.
== How Well does the Evolution Filter Work? ==
== How Well does the Evolution Filter Work? ==
Line 138: Line 168:
The new system isn't something that replaces the old but is built on it. Since the new system is really just a simple AI, it really has no concept by itself as to what is good email and what isn't. The old system has the ability to positively identify most email as good or bad with high confidence. This high confidence email if fed directly into the learner and creates a "moral compass" as to the difference between good and evil. From there the Evolution Filter figures out the messages that are in the middle. That which hasn't been able to be classified by other methods.
The new system isn't something that replaces the old but is built on it. Since the new system is really just a simple AI, it really has no concept by itself as to what is good email and what isn't. The old system has the ability to positively identify most email as good or bad with high confidence. This high confidence email if fed directly into the learner and creates a "moral compass" as to the difference between good and evil. From there the Evolution Filter figures out the messages that are in the middle. That which hasn't been able to be classified by other methods.
 +
 +
=== Potential for Improvement ===
 +
 +
Although the filter works extremely well as is I think there's a lot of room for improvement. I think that the Evolution Filter has the potential for at least a 10x improvements over what I have achieved so far. The accuracy is so good it's scary and it might actually be the FUSSP. (Final Ultimate Solution to the Spam Problem) This can actually put spammer out of business if improved and widely implemented.
= The Patent =
= The Patent =
 +
 +
I've decided NOT to patent this. But here is what I wrote up.
You might be wondering - why patent this? The simple answer is that if I make a huge leap in spam filtering technology that saves the world trillions of dollars, my reward shouldn't be to put myself out of business.  
You might be wondering - why patent this? The simple answer is that if I make a huge leap in spam filtering technology that saves the world trillions of dollars, my reward shouldn't be to put myself out of business.  
Line 158: Line 194:
== Licensing ==
== Licensing ==
-
Right now this is Patent Pending. The proposed pricing is as follows:
+
This no longer applies. Not doing the patent. Released under GPL2. If you improve it you have to share your improvements. But feel free to donate some money to me if you find this useful. Paypal to paypal@churchofreality.org or billing@junkemailfilter.com
-
For large email providers. Those with over 10,000 email accounts. 3 cents/eccount/year or 15 cents/email account/lifetime license.
+
Some people said I should just keep this method a secret and just get more business by having the best spam filter on the planet. But I feel that by releasing this method to the public that it will make trillions of dollars different in the world economy saving billions of people from being scammed or at least wasting hours deleting junk email. And I also really really hate spam. This new method is a game changer and if widely adopted can actually put spammers out of business. Their delivery rate will be so low that it won't be profitable anymore.
-
 
+
-
For spam filtering providers 1/10 of 1% of your gross revenue.
+
-
 
+
-
And - of course - all is negotiable. If your situation is unique, we'll find a way to make a deal.
+
-
 
+
-
I'm planing on using the [https://wiki.creativecommons.org/wiki/Model_Patent_License Creative Commons Model Patent License] with a few restrictions.
+
-
 
+
-
Some people said I should just keep this method a secret and just get more business by having the best spam filter on the planet. But I feel that by releasing this method to the public that it will make trillions of dollars different in the world economy saving billions of people from being scammed or at least wasting hours deleting junk email. And I also really really hate spam. This new method is a game changer and if widely adopted can actually put spammers out of business. Their delivery rate will be so low that it won't be profitable anymore.
+
-
 
+
-
In retaining a patent I can have some control over the IP and I can get a little something for coming up with the idea. Not looking to be greedy about it but not wanting to put myself out of business either. I'm looking to see my income model changing to licensing technology. This is the most advanced filter on the planet. It's worth licensing.
+
= How to Implement this Spam Filtering Method =
= How to Implement this Spam Filtering Method =
Line 180: Line 206:
== Redis is the core ==
== Redis is the core ==
-
For those of you not familiar with Redis you'll need to learn it. Redis is easy. Redis is called a nosql database, It these features which are essential to making this work.
+
For those of you not familiar with [http://redis.io/ Redis] you'll need to learn it. Redis is easy. Redis is called a nosql database, It these features which are essential to making this work.
# It lives entirely in ram - therefore it is extremely fast
# It lives entirely in ram - therefore it is extremely fast
Line 246: Line 272:
Having multiple attributes adds to the accuracy. For example, if someone forward a spam to me at support the good attributes will prevent the forwarded email from being classified as spam. However, my support email account is flagged as a "do not learn" account to prevent contaminating the corpi. Do not learn logic will help improve accuracy.
Having multiple attributes adds to the accuracy. For example, if someone forward a spam to me at support the good attributes will prevent the forwarded email from being classified as spam. However, my support email account is flagged as a "do not learn" account to prevent contaminating the corpi. Do not learn logic will help improve accuracy.
 +
 +
Besides adding the tokes to the sets the tokens are stored separately. Each token has a counter and when it is relearned the counter is incremented by 1 and the expiration is reset to 1,000,000 seconds. (about 11 days). This creates a score for that item to be compared to create the "nearly" sets and to eliminate orphans for reducing the size of the corpi.
== Email Testing System ==
== Email Testing System ==
Line 259: Line 287:
To get more precision the system also creates what I call the "nearly" sets. Once an hour all fingerprints are iterated on the tokens which intersect both the spam and ham sets. In this system tokens appearing in both sets are of neutral score as they are both ham and spam. However if the ratios are extreme (100 to 1 spam) this picks them up.  
To get more precision the system also creates what I call the "nearly" sets. Once an hour all fingerprints are iterated on the tokens which intersect both the spam and ham sets. In this system tokens appearing in both sets are of neutral score as they are both ham and spam. However if the ratios are extreme (100 to 1 spam) this picks them up.  
-
What I do is go through each member of the intersect set looking for large ratios
+
What I do is go through each member of the intersect set looking for large ratios and add those fingerprints to the nearly ham only and nearly spam only sets. After the main test is done the test set is compared to the nearly sets for scoring. I'm dividing the score by 2 right now but probably will adjust it after more testing.
 +
 
 +
After all attributes are tested the scores are added up. I'm also applying a logarithmic expansion to the result because the farther you get from 0 the higher the confidence is. I'l taking the result to the 7/5 power right now. I then add that to some traditional scoring methods. It test to see if the score is significant enough to classify or learn. If it is, I'm done. If not than I fall back to to desperation and let SpamAssassin have at it and RSPAMD. If that doesn't result in a blocking score - I pass it.
 +
 
 +
== Implementing it in Open Source Packages like SpamAssassin and RSPAMD ==
 +
 
 +
The reason I'm documenting the details of this is to get the open source world excited about adding it to their systems. As I said, I hate spam. So yes - I'll lose some business over this but some things are worth losing business over.
 +
 
 +
[http://spamassassin.apache.org/ SpamAssassin] is a natural for implementing this because:
 +
 
 +
# They are already using Redis so this will be easy
 +
# As a spam filter SpamAssassin is better than nothing, this will make it 1000 times as accurate
 +
# If SpamAssassin made fingerprints out of the rule names hit and did a combination fingerprint it would become self scoring and it would automatically write and score it's own combination rules.
 +
 
 +
New rules can be written just to notice things and without having to score the rule. The Evolution Filter will do the combining and scoring automatically. Examples would be:
 +
 
 +
* Greetings Stranger
 +
* References Lots of Money
 +
* References Religion
 +
* References Diseases
 +
* References Royalty or other Important People
 +
* Guarantees Results
 +
* References SEO
 +
* References Your Account
 +
* References Drugs
 +
* References Winning something
 +
* References Bank Accounts
 +
* References Stocks
 +
* References Sex
 +
* References Trust Me
 +
* References Gambling
 +
* Marketing Words
 +
* Urgency Words
 +
* Sales Language
 +
 
 +
The Evolution learner will make fingerprints out of all these names and store all combinations of tokens as spam or ham and when compared it will notice combinations showing up on one side and not the other. For example [greetings stranger, lots of money, bank accounts, religion, urgency] that's spam. So I'm hoping that the good folks over at SpamAssassin who have been working hard for many years fighting the good fight will see this and say YES! OMG!
 +
 
 +
Similarly [https://rspamd.com/ RSPAMD] is not as developed as SpamAssassin but is is much faster. Written in C rather than Perl it flies. So I'm hoping they also will pick this up and incorporate this in their product. RSPAMD doesn't use Redis, but I learned it and it really easy. No reason why they can't do it too.
 +
 
 +
== Help from me ==
 +
 
 +
If you are a developer and you are implementing this I am willing to assist you in your development. I obviously can't cover all the details here but contact me at [mailto:support@junkemailfilter.com support@junkemailfilter.com] with any questions you might have.

Latest revision as of 23:17, 28 August 2016

Contents

NO Patent

Although this was originally written up as a patent application, I have decided NOT to pursue the patent. I am releasing this under the GPL2 license. So - go forth and steal this idea. Us it to fight spam. But if you improve it you have to share it. If I'm giving it away - you're giving it away. GPL2 license. If you improve it you have to share it.

The Evolution Filter

The last big advancement in spam filtering was done by Paul Graham - A Plan for Spam. He was the first to apply Bayesian Filtering to blocking email back in 2002. Since then not a lot has happened to make spam filtering significantly better, till now. The Evolution Filter is a new plan for spam.

I chose the name "Evolution Filter" because it's a very simple and fast recursive learner that mimics evolution in software. While most neural networks are complex AI, this one is dirt simple but extremely fast. It seems that simple and fast is the key. In 6 months it might become self aware. :)

The Evolution Filter is a Trademark of Junk Email Filter.

Overview of How it Works

Have you ever looked at a list of email messages and get the feeling that you can classify 70% of them just seeing the sender name and the subject line? Have you ever wondered how it is so easy to recognize ham and spam yet computer can't seem to figure out the obvious. When you look at the list - how is it that you can tell? What's going through your mind?

The way you recognize spam and ham is that when you see a subject line that looks similar to good email and dissimilar to spam then it's good email. And if it looks similar to spam because it says stuff that good email never says, it's spam.

For example, the subject line is "let's get some dinner". We know the message is good because spammers never say that. So if the subject is something you've seen before in good email and something that spammers never say - it's good email.

If the Subject matches Ham and does not match Spam - it's Ham.
If the Subject matches Spam and doesn't match Ham, it's Spam.

In fact - that's how the Evolution filter works. If, for example, the subject has words and phrases that match good email but spammers never say then the message is good. If the subject has words and phrases that match messages that spammers have used, but never seen in good email, then it's spam.

Instead of matching against finite sets this filter works on NOT matching.
Not matching is like matching against an infinite set.

But, you might ask, where do I get a list of words and phrases that spammers never say? It's easier than you think. What I do is create a set of every word and phrase spammers do say and test to see if it's NOT in the list. In other words, I store all the words and phrases that are said in ham, and all the words and phrases that are said in spam. If the test message matches ham and doesn't match spam, it's ham. And if it matches spam and doesn't match ham, it's spam.

So - what do I mean by words and phrases? I take the subject and I break it down using sequential tokenization.

"the quick brown fox jumps over the lazy dog"

becomes ...

"the" "quick" "the quick" "brown" "quick brown" "the quick brown" "fox" "brown fox" "quick brown fox" 
"the quick brown fox" "jumps" "fox jumps" "brown fox jumps" "quick brown fox jumps" "over" "jumps over"
"fox jumps over" "brown fox jumps over" "the" "over the" "jumps over the" "fox jumps over the"
"lazy" "the lazy" "over the lazy" "jumps over the lazy" "dog" "lazy dog" "the lazy dog" "over the lazy dog"

Or - I might use combination tokenization:

"A B C D E" 

becomes ...

“A” “AB” “B” “C” “AC” “ABC” “BC” “D” “AD” “ABD” “BD” “CD”
“ACD” “ABCD” “BCD” “E” “AE” “BE” “CE” “ACE” “BCE” “DE” “ADE”
“ABDE” “BDE” “CDE” “ACDE” “ABCDE” “BCDE”

See Figure 3

I'm using a database called Redis. Redis lives in ram and it's very fast. It is extremely good at set comparisons. So I can create a spam corpus and a ham corpus that contain million of words and phrases and I can break the subject down into hundreds of fingerprints and compare it to both sets and see what matches and not matches to get a result. The furmula is as follows.

card(Test_message intersect Spam diff Ham) - card(Test_message intersect Ham diff Spam)

This is a simplification of the concept. In real life I'm not just testing the subject. I'm testing the name part of the from address, the attachment file names and extensions, php scripts, parts of the message body, the header structure, and behavior of the sender. All these produce scores that are combined into a result that is far more accurate than any method of sorting email ever developed.

The New Concept

The way the Evolution Spam Filter works is literally thinking outside the box. Every other filter is about matching something. One box has ham, one box has spam. Bayesian filters match the message with the two boxes to see which one is most similar. Or the filter is matching rules. Does it contain "Russian Brides"? If so - add penalty points.

the other filters are all about matching. The Evolution Filter is about NOT MATCHING. It's all about what the other side never says. While other filters match what's inside the box, the Evolution Filter matches what's outside the box. Instead of matching finite sets of known information, we are matching to the infinite set of unknown information. And we all know that an infinite set is always bigger than a finite set. How much bigger? Infinitely bigger! That's why it works so well.

Example of how NOT matching works

Let’s take 2 subject lines and see how this works.

“Meet hot Russian Brides Online!”
“I read an article about Russian Brides in a magazine”

A traditional spam filter using Bayesian or hard coded rules about “Russian Brides” might determine that only 1 out of 500 emails mentioning the phrase “Russian Brides” is a good email. Thus the second line would have points assessed against it in the classification process using these traditional methods.

Using the Evolution Filter the phrase “Russian Brides” is in both sets and therefore has no influence on the results. But the first subject matches these phrases in the Spam Only set.

“Meet hot”
“Meet hot Russian”
“Meet hot Russian Brides”
“hot Russian Brides Online!”
“Russian Brides Online!”
“Brides Online!”
“Online!”

The second subject matches these phrases on the ham only set that are never used on the spam set.

“I read an article”
“read an article”
“read an article about”
“about Russian”
“an article about”
“in a magazine”
“Brides in a”

So even though the phrase “Russian Brides” has no influence each subject hits either ham or spam many times where the same phrase was never used in the subject line in the opposite set. And the number of hits is significant enough just from these subjects to cause the fingerprints to be learned, and that’s just looking at the Subject attribute. When this is combined with testing all attributes the messages usually come out strongly on one side or the other.

In rule based systems one would not normally build a white list rule to to allocate points based on seeing the phrase “read an article about”. That’s where the Evolution Filter is different. It didn’t need to have that rule because since it is comparing to the infinite set of what is not matched on the other side, it dynamically create billions of rules automatically.

Learning by Association

For example, suppose I filter email for a machine dealer who sells “Machine A”. An email comes in from a trusted source and “Machine A” is learned as ham. And in millions of spams no one has ever mentioned “Machine A”. Then one second later someone else we filter for mentions “Machine A”. They are classified as ham based on that one match. And if there are several matches on the ham side that are not matched on the spam side then the message fingerprints can be added to the ham sets.

Because the comparison method is based on matching one set and not matching the other the learning feedback system is a lot faster and has different characteristics than a traditional Bayesian filter.

In my previous example, “Machine A” has been learned as ham and never seen in spam. Someone sends an email inquiring about “Machine A” and “Machine B”. Because “Machine B” was never ever used in a spam then “Machine B” also becomes a blessed phrase. Anyone who uses “Machine B” in their email is passed as ham. (Unless spammers start spamming about “Machine B” which would revert it to neutral.)

Once a few phrases in one email message are matched to a new email then all the fingerprints of the new email are learned as ham. And the new fingerprints that are not already in the ham sets and are not in the spam sets become effectively new rules for identifying ham. The system learns how you talk, what you are interested in, and people in your life that are interested in the same things have their email passed and learned. Then their friends interact with them and the learning continues.

Consider this example, Email Subject Only. Brackets enclose a phrase learned. Parens enclose the match phrase from previous email learned.

“Do you want to get some [lunch today]?” 
“Going to (lunch today) and [to see a movie] afterwards” 
“I want (to see a movie) about time travel, [are you interested]?”
“(Are you interested) in [getting together] after work at my place?”
“If more people were (getting together) to make a better world we would have [less poverty.]”
“[Better education] leads to (less poverty.)”

Even though the subject wanders the matches keep feeding the learner because fingerprints are learned on one side the do not match fingerprints on the other side. In reality the subjects in the above example might produce 5 to 10 fingerprints each that were never used in spam.

People who communicate by email usually have some sort of relationship and they talk about things they have in common. It’s the things that they have in common that causes not only the email to pass, but the differences in the messages to be learned as well.

Similarly on the spam side, there are only so many ways you can misspell Viagra, and the first time it catches a message with it deliberately misspelled then that spelling is learned and every spam that misspells Viagra the same way is caught. Traditional rule based system encourages spammers to misspell words so they don’t match the rule. With the Evolution classifier the misspelling is what gives them up because people who send good email never misspell, for example, Viagra as Viiaggra.

Spammers also want you to do something. it’s a business model that there are just so many scams out there and so many ways to describe these scams. So as the system learns these phrases that only spammers use then it’s easy to detect new scams based on older but similar scams.

As the recursive learning continues it separates out what is essentially 2 different cultures and languages. Things that only people who send good email talk about and things that only spammers talk about. As these sets grow the accuracy increases and less and less messages go unrecognized.

Example Data

To illustrate this concept of NOT matching I present the following lists. It would take 9 million subject rules to duplicate this. The Evolution Filter writes and scores its own rules.

Here is a list of 100,000 words and phrases used in the subject line of HAM and never seen in the subject line of SPAM.

Here is a list of 100,000 words and phrases used in the subject line of SPAM and never seen in the subject line of HAM.

Foreign Languages

One of the advantages of the Evolution Filter is that it makes its own rules. It doesn't need to know anything about the language a message is written in. As long as it has enough samples it can figure out on its own if something is spam or ham. It can be French, Spanish, German, Russian, or even Klingon spam. It can actively detect good email and spam just through the learning process.

Examples of Klingon spam:

Discount Ferengi Tooth Sharpeners
Get your deceased relative into Stovokor Now!
Orion slave girls waiting to meet you!
Are you a descendant of the House of Kahless?
Enhance your Manhood! Romulan Pharmacy Online!
Bottom Price Risa Vacation, Jamaharon included!!

Protecting Good Email

Most spam filters focus on blocking spam. They identify spam and what they fail to identify is ham. The Evolution Filter can actively identify both spam and ham. In fact it is actually better at positively identifying good email than bad email. The trick to identifying good email is by detecting words and phrases that spammers never say.

Although the name of spam filtering businesses is blocking spam, the real mission is to not block good email. If a customer gets a few spams sneak through it's no big deal. But if I block important emails then people get upset. Spam filters that focus on identifying spam need to be somewhat week so that they don't misidentify good email as spam. However the Evolution Filter can positively identify good email strongly in a way that can override the results of spam detection and save the good email from being rejected.

How Well does the Evolution Filter Work?

Our spam filtering system has always worked very well. Unlike most filtering system that focused on blocking spam and what was left over was good, we also focused on ways to positively identify good email. In this business if a few spams get through it's not a big deal. But when good email is blocked it's far worse. So we always looked for ways not just to block spam, but to look for ham. We wanted to be able to positively identify good email.

Good email was harder to identify but it can be done to a limited extent and we could classify about half of all good email as good with very little spam sneaking through. But half isn't good enough.

The strong point of the Evolution Filter is that it is very good at positively identifying good email. It is also better at identifying spam than any other method but when it comes to protecting good email, that's where it really shines. The accuracy rate is so good it's scary. This doesn't just block spam, it decimates spam. If this was widely adopted many spammers would get out of the business because it will cut their delivery down so much that it won't be profitable for them anymore.

The thing about spammers is that spammers always want you to do something. So they have to make the sales pitch in the message to convince you to do what they want you to do. They can't get away from that. This system learns all the scams and all the different ways to describe the same things and it can identify them. Spammers may be able to fake a few things but that can't fake it all. Unlike rule based systems like SpamAssassin that spammers can download and test their rules, the Evolution system create millions of rules on its own and the spammers have no idea what to match because it's a not-matching system.

So what I'm seeing is this. less spam is getting through and less good email is getting blocked. But another factor is that the confidence we have in passing good email and blocking bad email has greatly increased. We aren't passing as much simply because it has no score. We are seeing about a 95% rate of positively identifying good email as good. On the spam side we were already over 99% positively identified as spam but the scoring is even stronger now and it's picking up more of the ones we weren't sure about and classifying them.

At first I had these systems running in parallel with the idea that I would use the old system to identify problem in the new system. It started working the other way. The new system was finding problems in the old system. This has allowed me to eliminate rules that I thought were better than they really are.

The new system isn't something that replaces the old but is built on it. Since the new system is really just a simple AI, it really has no concept by itself as to what is good email and what isn't. The old system has the ability to positively identify most email as good or bad with high confidence. This high confidence email if fed directly into the learner and creates a "moral compass" as to the difference between good and evil. From there the Evolution Filter figures out the messages that are in the middle. That which hasn't been able to be classified by other methods.

Potential for Improvement

Although the filter works extremely well as is I think there's a lot of room for improvement. I think that the Evolution Filter has the potential for at least a 10x improvements over what I have achieved so far. The accuracy is so good it's scary and it might actually be the FUSSP. (Final Ultimate Solution to the Spam Problem) This can actually put spammer out of business if improved and widely implemented.

The Patent

I've decided NOT to patent this. But here is what I wrote up.

You might be wondering - why patent this? The simple answer is that if I make a huge leap in spam filtering technology that saves the world trillions of dollars, my reward shouldn't be to put myself out of business.

This filtering method is the most accurate method on the planet and the most resilient to being defeated by spammers. No one else has done this because if they had done it - everyone would be using it.

My plan is to make it free to most everyone and charge a reasonable license fee to the big providers and my competitors. But it will be highly profitable for my competitors to license it from me as the gains in business and customer satisfaction would be far less than the cost of the license.

Here are the details on my provisional patent:

Licensing

This no longer applies. Not doing the patent. Released under GPL2. If you improve it you have to share your improvements. But feel free to donate some money to me if you find this useful. Paypal to paypal@churchofreality.org or billing@junkemailfilter.com

Some people said I should just keep this method a secret and just get more business by having the best spam filter on the planet. But I feel that by releasing this method to the public that it will make trillions of dollars different in the world economy saving billions of people from being scammed or at least wasting hours deleting junk email. And I also really really hate spam. This new method is a game changer and if widely adopted can actually put spammers out of business. Their delivery rate will be so low that it won't be profitable anymore.

How to Implement this Spam Filtering Method

You might be wondering - why don't I release my code. My code is highly integrated into my system. It's a mix of many languages including Exim scripts, PHP, Pascal, Perl, and Bash scripts. In other words - it's a mess.

I myself am a good programmer but not a great programmer. I would describe myself as both innovative and sloppy. So now that I've come up with the method, I think other programmers can do a better job of implementing it than I can. Having pioneered this I can share what I've learned so far to put you on the right track. And you will probably do a better job by taking the time to do it right. Quite frankly - I could use some help with a few pieces of this.

Redis is the core

For those of you not familiar with Redis you'll need to learn it. Redis is easy. Redis is called a nosql database, It these features which are essential to making this work.

  1. It lives entirely in ram - therefore it is extremely fast
  2. It is extremely fast specifically doing set operations.
  3. Many other common programming languages have Redis interfaces

At the core of the process is fast set comparisons. The new messages are tokenized. The tokens are combined into Fingerprints. The fingerprints are compared to the spam and hams sets to get a result. The if there is a definitive result the fingerprints are learned by adding them to the Redis sets.

Redis also stores the fingerprints as separate variables with a count indicating how many times the fingerprint was learned. The fingerprints are also set to expire (I'm using 1,000,000 seconds) and if not refreshed they will be expired and eventually removed from the sets. The system needs to be able to forget old data.

On my system I have data save to disk set at 2 hours. But I have a backup system that forces a save every hour and takes a snapshot. It also does daily and weekly snapshots to allow me to wind back the clock should I need to do so.

Definitions

Taking this from my Provisional Patent:

In order to talk about the Evolution Classifier terminology will need to be defined.

Item - An item is the thing that is being classified. In this description of how the system with we will be using email messages as the item example. However an item is anything that can be classified. It could be blog posts. It could be photographs. It could be the list of things a person bought at a grocery store.

Stream - Is the line of items coming into the system for classification. A stream of items come into the system, they are classified, and they they are output to their respective destinations. When items are classified with high confidence as to if they are (for example) ham or spam then these items would be run through the learner where the items fingerprints are stored. These learning streams might be referred to as the “ham stream” or the “spam stream” which feed the ham and spam corpi.

Corpus - is a set or collection of fingerprints or a collection of sets of fingerprints (Figure 1A) that has been learned and is used as a reference for comparison and evaluating new items. There would generally be 2 or more distinct corpi. In this example the spam corpus and the ham corpus. Information from the input stream is classified and one of the results is new information is added to the corpus and old and wrong information is removed from the corpus by unlearning or forgetting. (See Figure 1A)

Token - A token is a unit of information that is either contained in the item or known about the item. In the case of email the words in the email are tokens. The message headers are also tokens. Other tokens can be generated by what’s not there or missing. “Missing Subject” is a token. “Failed to close the connection”, “Received on backup server”, “slow data rate”, all of these are pieces of information, which include behavioral information, are regarded as tokens. (See Figure 1C)

Fingerprint – A fingerprint is individual tokens or combinations of tokens that are extracted from the items tokens. Tokens are often extracted in a manner to increase the possibility that that they will be relevant to the classification process. Tokens can be used individually and/or combined into fingerprints by selecting 2 word, 3 word, 4 word, etc. phrases sequentially or by combining many combination of characteristics of the item into groups of 2 characteristics, 3 characteristics, 4 characteristics, etc. The fingerprints are then used to compare the item to the fingerprints in the corpi to classify the item. After classification the items fingerprints might be added to the matching corpi, or deleted from the non-matching corpi to unlearn the common fingerprints. (See Figure 1C)

Attributes - Attributes are fingerprint categories that separate item fingerprints into different classifications for separate comparison. For example in email the Subject would be an attribute. The message body would be a different attribute, The message headers would be yet another. Attached file names, the name part of the From address, The text inside of links within the message, the names and paths of the PHP script generating the message, as well as facts about the behavior of the email sender, are also attributes. (See Figure 1B)

Attributes contain different kinds of information and are often processed differently. An email subject, for example would be fingerprinted sequentially. (Figure 3A) A behavior attribute would be fingerprinted by generating fingerprints of all combinations of behavior looking for matching combinations stored in the corpi. (Figure 3B)

Generally attributes are fingerprinted separately, stored separately, and compared separately. Although the subject line and the body are both text they can be processed as separate independent sets. So the ham and spam corpi and not necessarily just one set but a collection of sets of fingerprints of different attributes. (Figure 1A)

Different attributes might be fingerprinted differently. For example - the subject line and parts of the body would be fingerprinted sequentially, the same order that the words appear, as 1, 2, 3, 4, … word phrases. The number of tokens combined in the subject fingerprints might be different than in body fingerprints. Behavioral characteristics might be fingerprinted using all combinations of the tokens in groups of 1, 2, 3, 4, … tokens. This allows different attributes to be tokenized differently and compared to their individual sets. (Figure 3A and 3B)

Sets - A set is a mathematical concept that is basically of collection of things. Sets can be compared in basically 3 ways. Set union is the sum of both sets (A or B). Set intersection are the set members are are in both sets (A and B). And set difference is one set minus the items in the other set. There can be 2 difference operations with 2 sets. Set A minus Set B. And Set B minus Set A. (Figure 5A and 5B)

Tokens, Fingerprints, and Attributes

The test process is about taking an email and creating fingerprints from it. As many fingerprints as possible from the parts of the emails that are relevant. I am currently fingerprinting these attributes:

  • Subject
  • First 50 words of the Body
  • Name part of the From address
  • PHP script names where path elements are broken down as separate tokens
  • Attached file names
  • The Header Structure
  • Text inside Links
  • Behavioral Attributes

Each attribute has a ham set and a spam set. So I have 16 sets right now. The fingerprints are also stored separately in Redis outside the sets where a prefix is applied to the data to distinguish it's attribute and spam status. This is for the purpose of expiring date and creating what I cal the "nearly" sets.

Nearly sets are created once an hour where a fingerprint appears as both ham and spam, but the count is far higher on one side that the other. My ration is 10x the smaller side squared. 10 to 1, 40 to 2, 90 to 3, etc. The nearly sets allow for scoring of lopsided ratios but I score them at 1/2 of the main process that is looking for only in one and not in the other.

I'm using a maximum fingerprint size of 4 tokens which I think is a balance between a good amount of data and data overload. The first 7 attributes in the above list are fingerprinted as sequential tokens. The behavior is fingerprinted as combination tokens. Figure 4

Training the system - The Moral Compass

This system relies on an external filtering system that can identify a significant portion of the incoming spam stream as definitely ham or spam independently and those streams are fed directly into the learning system. I'm using spam bait which gets me a lot of 100% pure spam and I have a very large whitelist of known sources of 100% pure ham. These messages are all learned creating a the base corpus sets for which new email is compared. Figure 2

The system learns very fast. I have a lot of email flow and in just 3 hours of training the filter was doing very well. However after a few weeks it got really good. As new messages score highly one way or they other these new messages fingerprints are added to the corpi. The new fingerprints expand the vocabulary of the corpi so you get a sort of "friends and family" effect. It learns how normal people talk and it learns how spammers talk and the more it learns the more messages it can identify.

When the system is mostly untrained it doesn't make a lot of mistakes in that it rarely misclassifies something. What happens when you don't have enough training is that it doesn't classify the messages at all. Thus as it learns the accuracy does improve, but it gains mostly in that fewer messages go unscored, and those messages that are scored get a stronger score.

Having multiple attributes adds to the accuracy. For example, if someone forward a spam to me at support the good attributes will prevent the forwarded email from being classified as spam. However, my support email account is flagged as a "do not learn" account to prevent contaminating the corpi. Do not learn logic will help improve accuracy.

Besides adding the tokes to the sets the tokens are stored separately. Each token has a counter and when it is relearned the counter is incremented by 1 and the expiration is reset to 1,000,000 seconds. (about 11 days). This creates a score for that item to be compared to create the "nearly" sets and to eliminate orphans for reducing the size of the corpi.

Email Testing System

Email is first tested by more traditional methods to detect high probability spam and ham (as described previously). These messages are routed straight to the learning system. The idea is to peel off messages for both ends and then try to classify what's in the middle. After other high confidence tests are performed then the message is ready for the Evolution Classifier.

The message is separated into attributes. the attributes are tokenized and then the tokens are combined into fingerprints. Each attribute is processed separately. I'll start with the subject as an example.

The subject is fed into Redis creating a test set of subject fingerprints on the redis server, This is a temporary set that is immediately deleted after the test. The test set is intersected with the subject ham set creating a (Test intersect Ham) set and it is intersected with the subject spam set creating a (Test intersect Spam) set.

Then two set diffs are performed. Ham minus Spam and Spam minus Ham. Count the lines on both sides, subtract ham from spam. If the result is positive, it's spam. If it's negative, it's ham. Temp sets are deleted. The process is repeated for all 8 attributes.

To get more precision the system also creates what I call the "nearly" sets. Once an hour all fingerprints are iterated on the tokens which intersect both the spam and ham sets. In this system tokens appearing in both sets are of neutral score as they are both ham and spam. However if the ratios are extreme (100 to 1 spam) this picks them up.

What I do is go through each member of the intersect set looking for large ratios and add those fingerprints to the nearly ham only and nearly spam only sets. After the main test is done the test set is compared to the nearly sets for scoring. I'm dividing the score by 2 right now but probably will adjust it after more testing.

After all attributes are tested the scores are added up. I'm also applying a logarithmic expansion to the result because the farther you get from 0 the higher the confidence is. I'l taking the result to the 7/5 power right now. I then add that to some traditional scoring methods. It test to see if the score is significant enough to classify or learn. If it is, I'm done. If not than I fall back to to desperation and let SpamAssassin have at it and RSPAMD. If that doesn't result in a blocking score - I pass it.

Implementing it in Open Source Packages like SpamAssassin and RSPAMD

The reason I'm documenting the details of this is to get the open source world excited about adding it to their systems. As I said, I hate spam. So yes - I'll lose some business over this but some things are worth losing business over.

SpamAssassin is a natural for implementing this because:

  1. They are already using Redis so this will be easy
  2. As a spam filter SpamAssassin is better than nothing, this will make it 1000 times as accurate
  3. If SpamAssassin made fingerprints out of the rule names hit and did a combination fingerprint it would become self scoring and it would automatically write and score it's own combination rules.

New rules can be written just to notice things and without having to score the rule. The Evolution Filter will do the combining and scoring automatically. Examples would be:

  • Greetings Stranger
  • References Lots of Money
  • References Religion
  • References Diseases
  • References Royalty or other Important People
  • Guarantees Results
  • References SEO
  • References Your Account
  • References Drugs
  • References Winning something
  • References Bank Accounts
  • References Stocks
  • References Sex
  • References Trust Me
  • References Gambling
  • Marketing Words
  • Urgency Words
  • Sales Language

The Evolution learner will make fingerprints out of all these names and store all combinations of tokens as spam or ham and when compared it will notice combinations showing up on one side and not the other. For example [greetings stranger, lots of money, bank accounts, religion, urgency] that's spam. So I'm hoping that the good folks over at SpamAssassin who have been working hard for many years fighting the good fight will see this and say YES! OMG!

Similarly RSPAMD is not as developed as SpamAssassin but is is much faster. Written in C rather than Perl it flies. So I'm hoping they also will pick this up and incorporate this in their product. RSPAMD doesn't use Redis, but I learned it and it really easy. No reason why they can't do it too.

Help from me

If you are a developer and you are implementing this I am willing to assist you in your development. I obviously can't cover all the details here but contact me at support@junkemailfilter.com with any questions you might have.

Personal tools