From Computer Tyme Support Wiki

Revision as of 21:14, 8 September 2007 by Marc (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Creating White/Yellow/Black DNS lists for email systems in the fight against spam.

Free DNS host karma listing servers to provide information to the world about what servers are sending spam, nonspam, or a mix of spam and nonspam. This is a service of Junk Email Filter dot com. One of many technologies used in advanced email filtering.

1 Using these Lists is Almost Free
2 How to use the Lists
3 Name Based DNS Lookup
4 Overview of the Lists
5 What Kinds of Spam Does this list Work With?
6 How the System Works
7 The Magic is in the White Lists
8 Problems this Service Solves
9 Can Spammers Out Smart This System?
10 This Service is under Development
11 The Future of the Concept - The Big Picture
12 Privacy Issues
13 Joining In - Helping Development
14 Feeding Us Data

Using these Lists is Almost Free

Unless you really load our servers and suck a lot of bandwidth use of these lists are almost free. The price of using this list is that you have to post a link on your web site thanking us for the use of the list and linking to http://www.junkemailfilter.com. Your link is your license fee.

How to use the Lists

Junk Email Filter dot com provides 2 public lists one is a black list to block spam and the other is a white list to either pass nonspam or to keep sites from being blocked. Blocking is done by IP address which is something spammers can't spoof. We look at email hosts as being one of they kinds, hosts that generate only spam that we blacklist, hosts that generate only nonspam which we whitelist, and hosts that generate a mix which we yellow list.

Our blacklist server is hostkarma.junkemailfilter.com with result is 127.0.0.2 - if the IP is listed here you can bounce it without further checking.

Our whitelist server is hostkarma.junkemailfilter.com - this server returns two different results. If the server returns 127.0.0.1 then it is whitelisted. You can accept the email without any further checking. If the result is 127.0.0.3 then the host is yellow listed. Yellow listing means that host generates some nonspam. What that means is that this host should never be blacklisted and that other IP based blacklists should be bypassed to prevent false positives.

127.0.0.1 - whilelist - trusted nonspam
127.0.0.2 - blacklist - block spam
127.0.0.3 - yellowlist - mix of spam and nonspam
127.0.0.4 - brownlist - all spam - but not yet enough to blacklist

List Logic

The best way to use the lists is to do it in a specific order. First you check the white list and see if it is white. If so you accept the message without further processing. Then you see if the list is yellow. If so - you skip all your blacklist tests. Then you check your blacklists and if listed you bounce it. Whatever email is left is then tested with all your other testing methods like Spam Assassin.

Exim Examples

Exim is an extremely powerful MTA, probably the most powerful MTA on the planet. That's why I like it so much. I want to do what I want to do and Exim allows me to do it.

# Mark it White 
warn dnslists = hostkarma.junkemailfilter.com=127.0.0.1
     set acl_c1 = white - dnswl - $sender_fullhost

# Mark it Yellow 
warn dnslists = hostkarma.junkemailfilter.com=127.0.0.3
     set acl_c1 = yellow - $sender_fullhost

# Using the Black List
deny dnslists = hostkarma.junkemailfilter.com=127.0.0.2

# Other Blacklists
deny !dnslists = hostkarma.junkemailfilter.com=127.0.0.1,127.0.0.3
     dnslists = zen.spamhaus.org/<;$sender_host_address;$sender_address_domain :\
     nomail.rhsbl.sorbs.net/$sender_address_domain : cbl.abuseat.org :\ 
     list.dsbl.org : web.dnsbl.sorbs.net : socks.dnsbl.sorbs.net :\
     http.dnsbl.sorbs.net

Spam Assassin Examples

Spam Assassin can access the white and black lists for scoring.

header __RCVD_IN_JMF eval:check_rbl('JMF-lastexternal','hostkarma.junkemailfilter.com.')
describe __RCVD_IN_JMF Sender listed in JunkEmailFilter
tflags __RCVD_IN_JMF net
 
header RCVD_IN_JMF_W eval:check_rbl_sub('JMF-lastexternal', '127.0.0.1')
describe RCVD_IN_JMF_W Sender listed in JMF-WHITE
tflags RCVD_IN_JMF_W net nice
score RCVD_IN_JMF_W -5
 
header RCVD_IN_JMF_BL eval:check_rbl_sub('JMF-lastexternal', '127.0.0.2')
describe RCVD_IN_JMF_BL Sender listed in JMF-BLACK
tflags RCVD_IN_JMF_BL net
score RCVD_IN_JMF_BL 3.0
 
header RCVD_IN_JMF_BR eval:check_rbl_sub('JMF-lastexternal', '127.0.0.4')
describe RCVD_IN_JMF_BR Sender listed in JMF-BROWN
tflags RCVD_IN_JMF_BR net
score RCVD_IN_JMF_BR 1.0

Name Based DNS Lookup

The hostkarma DNS list supports name based lookups as well as IP based lookups.

<hostname>.hostkarma.junkemailfilter.com

127.0.0.1 = whitelisted
127.0.0.2 = blacklisted
127.0.0.3 = yellowlisted

Example: dig hermes.apache.org.hostkarma.junkemailfilter.com

Examples using Exim:

accept	dnslists = hostkarma.junkemailfilter.com=127.0.0.1/$sender_host_name
drop	dnslists = hostkarma.junkemailfilter.com=127.0.0.2/$sender_host_name

Overview of the Lists

Unfortunately these lists are not the only solution to spam. But these lists are designed to be a front end to your spam filtering process allowing you to identify with great accuracy much of your incoming email. These lists have two purposes, one is to catch some spam, but more importantly these lists are used mostly to identify nonspam and to prevent mixed hosts from being blacklisted accidentally but our lists and others. One of the problems with spam filtering is that legitimate senders fail to get their email through because it is miscategorized as spam. These lists help prevent that from happening.

Most spam filtering technology is based on identifying spam, and whatever is left is nonspam. Our method also actively identifies nonspam as well as spam. By actively identifying nonspam it eliminate false positives and shrinks the number of messages that you have to work hard to identify with tools like Spam Assassin. These tools are processor intensive and requires a lot of rules that do very well, but sometimes makes mistakes.

What Kinds of Spam Does this list Work With?

The black list catches spam only servers. Generally these include virus infected users who are being used as spam servers. The list is generated by honeypot accounts and spammer's behavior where spam is caught be dong things that only spammers do. This list isn't the best list in the world for catching spam, but it's real strength is catching nonspam.

The real power here is in the white lists. Those who are used to spam filtering need to think differently about spam processing in order to really get the idea.You have to understand that we are not just looking for spam. This list is to catch nonspam. Nonspam is actually easier in some ways because the nonspam servers aren't doing any tricks to hide. They consistently send out good mail. All we do is track that and once the server establishes a clean reputation we bless it.

How the System Works

Telling all my tricks would be too long. But central to the system is tracking hosts by collecting data by IP address and doing an analysis on the information to determine the karma of the host.

The idea is that multiple trusted servers feed data to a database that tracks IP addresses and counts the number of spams/nonspams sent by these hosts. A spam increments the spam counter. A nonspam increments the nonspam counter. As the counts go up the servers develop a reputation. Those who spew only spam make the blacklist. Those who spew only good email make the white list. And those who spew a mix make the yellow list.

Other technology is also used. Honeypot can blacklist a virus infected server instantly allowing the system to have a very fast response time to new spam servers. The system can also track good servers over a long time tracking good email and establishing a reputation. Much of the blacklist data comes from using fake low and high MX records. When a host hits only the fake high numbered MX records without hitting the low numbered MX records the host is a virus infected spam zombie.

White and Yellow listing are also done using a table of domain names that are known to only send good email or are know to send mixed email, (yahoo, hotmail). The RDNS is looked up, the host name is verified to see that it matches the name returned, and if the name ends in a host that is on our list then we add the IP address to our white or yellow lists.

We are always looking to expand our white and yellow lists so if you send email and your server send only good email and you want to be on our lists, email me at marc@perkel.com with your host name information.

The Magic is in the White Lists

Think differently. It's not just about blocking spam - it's about accepting good email. The real power in this system is the white and yellow lists, not the black list. Envision this. A bank who sends nothing but good email is communicating with tens of thousands of customers on a regular basis. Their email goes to thousands of servers who host the customer's email. So lets say that 30 of these servers are feeding data to the database. After a few months the IP address of the bank's server has 100,000 good emails recorded and say 20 spams (some people will accidently report spam in error). Thus the bank can be whitelisted. Why bother to check email from a host like that for spam?

And it's not just banks. It's all institutions that send only good email. No one has to pay a fee to get listed. It's a karma system. You're good reputation gives you a fast pass through the filter.

Some serves send a mixture of spam and nonspam. Example are AOL, Yahoo, Hotmail, Comcast. People who sell email services or ISPs. They try to get rid of spam, but some people exploit them anyway. These are servers that make the yellow list. The messages still need to be spam tested, but because they have a reputation of sending some good email they can at least bypass blacklisting. Thus - if a Comcast customer starts spamming through Comcast servers and Comcast doesn't detect it, this system will at least keep the Comcast server from being blacklisted which would prevent other Comcast customers from having their email blocked.

Problems this Service Solves

One famous controversy over spam filtering is the battle between AOL/Goodmail vs. the Electronic Frontier Foundation. In this case both sides are wrong with EFF being a little more wrong than AOL. The Goodmail/AOL relationship is based on the idea that Goodmaill certifies email as good and AOL accepts it as good email. But there's $$$ involved and because of this EFF has accused AOL as trying to turn email into a paid service. Unfortunately EFF can't get beyond listening to themselves echo their own opinion to understand that the concepts behind AOL/Goodmail are at least partially sound. The idea is to get the good email through.

This system eliminates the need for AOL/Goodmail's system in that it automatically tracks good email from all servers and makes their karma available to the world. So rather than having to pay to get a reputation as a trusted server all you have to do is consistently send good email and when the world sees that then you get whitelisted. Problem solved.

Can Spammers Out Smart This System?

The short answer is yes - probably some can. However it represents yet another significan hurdle for them to cross. In reality this system will block mostly easy to detect spam sources. But - that's not where the power lies. it doesn't matter if spammers out smart this system. What this system does is protect good email from being falsely identified as spam and blocked. This isn't a spam filter as much as a ham filter. The power is in identifying good email.

To block spam you would just use this as a front ent to your system to preclassify the easy spam/ham and them pass the rest on to meaner tests. A spammer might be able to fake their way from being blacklisted to yellowlisted. But not all the way to whitelisted.

This Service is under Development

What kind of accuracy can you expect using thse lists? At the moment the black list isn't as accurate as it should be in part because we need more data. For example, we are located in the US and we get a lot of spam from outside the US. Some of the servers send us only spam, but if we were in that country then we would see nonspam as well. Thus with our limited data it would create a false positive.

The power however is in the white lists and they should be more accurate. These lists can be used to bypass spam filtering for nonspam and increase accuracy and decrease load. And the yellow lists can be used to avoid false positives in other black lists.

The Future of the Concept - The Big Picture

This system can make a huge difference in the accuracy of spam detection for the entire planet. Every email server on the planet - if it were scaled up - could access these lists and eliminate some 50% of all spam and identify some 95% of all nonspam with 100% accuracy with extremely little effort. To do it right would take several major partners getting involved and better programmers than me to do it right.

Here's what it would take:

You would have a central (replicated) MySQL cluster that is big and hardened and secret and immune from DOS attacks. This is where the data for the lists are kept. If done right it might run on less that $10,000 worth of hardware.

As a front end to this are a number of MyDNS servers and caching front end servers that connect to the databases on the back end and providing a front end for email servers all over the world to access. It would also take some smart people and many servers running Spam Assassin to check the quality of the lists and verifying that the lists remain accurate. And it might take a few people to watch over it to make sure there aren't any problem and some programmers to adapt to spammers who will always try to beat the system.

This isn't going to solve the spam problem. But if done right it will significantly reduce the false positive problem allowing for far greater front end accuracy. This will greatly reduce system load and make the remaining email easier to process.

Privacy Issues

This system is totally privacy friendly. It does not requite any kind of personal information or the sharing of message content or header content. It merely keeps totals of the karma of the IP of the sending host. So personal liberty is preserved. This system is liberty friendly and help ensure the delivery of the email you want to send and the email you want to get.

Joining In - Helping Development

I need some help with this. If you are the person in charge of a large email system, preferably running Exim and Spam Assassin, and you are technically sharp, I can use some help making this system better through testing and development. it is also a system where the more data I have the more accurate and comprehensive the lists will be. So if you like what you are reading here then join in and let's make it happen.

First - use the lists. Add the above code to your ACLs and set it up to use the white, yellow, and black lists. Once you are comfortable with that then contact me and I'll set you up with access so that you can submit your data to the counters so that I can incorporate your information into the system.

The data you send will not violate anyone's privacy. I just need the IP address of the server and if it is spam or nonspam. The code is fairly simple.

My email address is marc@perkel.com and my spam filtering is so good that I don't have to hide it.

Feeding Us Data

We are looking for some good data feeds to help expand the list and improve accuracy. Feeding data involves running a simple shell scritp that basically sends a string to a port to report that an IP has sent spam or ham. The script looks like this:

#!/bin/bash
# ip-report script - GPL by Marc Perkel - 2006
#
# Usage: ip-report message ip_address
# Examples: - ip-report spam 1.2.3.4
#             ip-report ham  5.6.7.8
#
# Email me for the host and port info
#
# Runs netcat to send a string to a port on a host

echo "$*" | nc -w 4 host port

The idea here is to just submit IPs that you have a very high confidence are spam or nonspam. These submissions go into a MySQL database and every 5 minutes the live lists ae modified to reflect the statistical data. So alterations are live. A nonspam submission will instantly yellow list the IP and remove it from the black list.

What I'm looking for are people who process a LOT of email, who are innovative, and who can get excited about this concept and feed a lot of good data. I'm also interested in feedback and ways to improve the system.

The system now supports 4 kinds of feedback.

spam     - messages that are almost certianly spam
ham      - messages that are almost certianly not spam
nonspam  - messages that you think are probably not spam  
lowspam  - messages that you think are probably spam

For those of you familiar with Spam Assassin, spam would be a message scoring 15 points and ready for bayesian autolearn. Lowsam would be a meaage from 5-15 points. Nonspam would be in the range of -2 - 5 points. And ham would be below -2 points and ready to autolearn.

A sample Exim ACL to report spam might look like this: (untested)

warn spam=nobody
     set acl_c5 = $spam_score_int

warn condition = ${if >{$acl_c5}{150}{yes}{no}}
     condition = ${run {/etc/exim/ip-report spam $sender_host_address}{yes}{yes}}

warn spam = nobody
     condition = ${if <{$acl_c5}{-20}{yes}{no}}
     condition = ${run {/etc/exim/ip-report ham $sender_host_address}{yes}{yes}}

Note that $spam_score_int is 10 times what the spam score is.

Spam DNS Lists