Concept Parsing Spam Filter

From Computer Tyme Support Wiki

Jump to: navigation, search

This is a spam filtering trick I'm using but it's not SA, but could be easily adapted to SA.

Rather that just scan for regex strings it's useful to have a way to tell what things the message is talking about and reduce those to a single token that represents a concept. Then the concepts can be combined to produce rules.

For example, let's take your typical 419 scam: Generally it will have these kinds of characteristics.

dear stranger
i need your information
offers lots of money
dying of something
worships god
bank account
transfer money
reply to me
trust me
united nations
western union

So the idea is to reduce the message to a string of characteristics and then combine those characteristics into rules where the characteristics by themselves are harmless. So - here's my format for extracting characteristics.

I create files that contain lines that are individual regular expressions. The name of the file becomes the name of the token. All these files are contained in a single directory and all are read and processed. The first line of the file is the number of matches required to trigger the token. That way if there's a few false matches it's no problem. Here's and example:

File name: lose-weight

burn fat
clinically proven
every woman
extra weight
fat burn
just days
lose weight
natural weight
thick legs
tighten up
weight (rapidly|fast)
weight loss
weight reducing

The first line is the number 3 which means that 3 lines have to match to trigger the token. If 3 lines match then the word "lose-weight" is printed to the output stream.

This next file is named "stranger"

my name is
(dear|attn|attention) .{0,10}(friend|stock|IT |Internet|candidate|sirs?|madam|partner|investor|bel$
introdic(e|ing) (myself|ourselves)
(hi|hello) (dear|friend)
i am (a|an)\b
(i am|i'm) (mr.|ms.|mrs)
my dearest
introduc(e|ing) myself
hi there
good day
contacting you
my dear one
contact(ed|ing) you

As you can see - the concept I'm extracting is that they don't know me.

Here's my "lots-of-money" file:

the sum of
(billion|million|thousand) .{0,20}(dollars|pounds|euros|usd)
,000 (usd|euros)
this money
(united state(s)?|us|american) dollars
pounds sterling
british pounds
huge amount of money

Some result strings I get from what I have so far:

accountant email-adr friend https investor law lotsofmoney maillist mailto phone-num trust
cialis click css deals drugs email-adr html http maillist mailto optout phone-num regards
click contact css details doitnow email-adr http https marketing optout phone-num price privacy remote-img
claim click css dear email-adr guarantee html http mailto phone-num privacy reply2me security

The code that does this is very simple. I coded it in PHP but would be trivial to convert to Perl. Here's the entire program:


$message = file_get_contents ('php://stdin');
$message = strtolower($message);

$dir = scandir('/etc/exim/control/content');
foreach ($dir as $file) {
   if (strlen($file) > 2) {

function Scan ($file) {
global $message, $count;
   $reg = file ('/etc/exim/control/content/'.$file);
   $count = 0;
   $trigger = intval($reg[0]);
   $reg[0] = '';
   foreach ($reg as $regline) {
      $regline = trim($regline);
      if (($regline) and (preg_match('/'.$regline.'/i',$message,$matches))) {
   if (($trigger > 0 ) and ($count >= $trigger)) {
      echo "$file ";


You could write combination rules or feed these tokens into Bayes to make it self scoring. I'm throwing them into the AI system I developed a few months ago which does all the combining and scoring for me. But I think bayes should have a similar effect.

Just sharing this in case anyone finds it useful.

Personal tools