Concept Parsing Spam Filter
From Computer Tyme Support Wiki
This is a spam filtering trick I'm using but it's not SA, but could be easily adapted to SA.
Rather that just scan for regex strings it's useful to have a way to tell what things the message is talking about and reduce those to a single token that represents a concept. Then the concepts can be combined to produce rules.
For example, let's take your typical 419 scam: Generally it will have these kinds of characteristics.
dear stranger i need your information offers lots of money dying of something worships god bank account transfer money reply to me trust me africa united nations western union
So the idea is to reduce the message to a string of characteristics and then combine those characteristics into rules where the characteristics by themselves are harmless. So - here's my format for extracting characteristics.
I create files that contain lines that are individual regular expressions. The name of the file becomes the name of the token. All these files are contained in a single directory and all are read and processed. The first line of the file is the number of matches required to trigger the token. That way if there's a few false matches it's no problem. Here's and example:
File name: lose-weight attractive breakthrough burn fat clinically proven diet every woman exercise extra weight fat burn flab flabby formula jiggly just days lose weight medication metabolism natural weight obesity overweight slim thick legs tighten up tummy weight weight (rapidly|fast) weight loss weight reducing
The first line is the number 3 which means that 3 lines have to match to trigger the token. If 3 lines match then the word "lose-weight" is printed to the output stream.
This next file is named "stranger" 1 my name is (dear|attn|attention) .{0,10}(friend|stock|IT |Internet|candidate|sirs?|madam|partner|investor|bel$ introdic(e|ing) (myself|ourselves) (hi|hello) (dear|friend) i am (a|an)\b (i am|i'm) (mr.|ms.|mrs) greetings my dearest introduc(e|ing) myself hi there \bhi, hello[,!] greeting good day contacting you my dear one contact(ed|ing) you
As you can see - the concept I'm extracting is that they don't know me.
Here's my "lots-of-money" file:
2 the sum of (billion|million|thousand) .{0,20}(dollars|pounds|euros|usd) ,000 (usd|euros) gold this money (\d\d0,000|\d,\d00.00) (united state(s)?|us|american) dollars pounds sterling british pounds us(d)?\$ huge amount of money
Some result strings I get from what I have so far:
accountant email-adr friend https investor law lotsofmoney maillist mailto phone-num trust cialis click css deals drugs email-adr html http maillist mailto optout phone-num regards click contact css details doitnow email-adr http https marketing optout phone-num price privacy remote-img claim click css dear email-adr guarantee html http mailto phone-num privacy reply2me security
The code that does this is very simple. I coded it in PHP but would be trivial to convert to Perl. Here's the entire program:
<?php $message = file_get_contents ('php://stdin'); $message = strtolower($message); $dir = scandir('/etc/exim/control/content'); foreach ($dir as $file) { if (strlen($file) > 2) { Scan($file); } } function Scan ($file) { global $message, $count; $reg = file ('/etc/exim/control/content/'.$file); $count = 0; $trigger = intval($reg[0]); $reg[0] = ''; foreach ($reg as $regline) { $regline = trim($regline); if (($regline) and (preg_match('/'.$regline.'/i',$message,$matches))) { $count++; } } if (($trigger > 0 ) and ($count >= $trigger)) { echo "$file "; } } ?>
You could write combination rules or feed these tokens into Bayes to make it self scoring. I'm throwing them into the AI system I developed a few months ago which does all the combining and scoring for me. But I think bayes should have a similar effect.
Just sharing this in case anyone finds it useful.