Care and feeding of your Bayes

From Scalix Wiki
Revision as of 00:08, 31 July 2006 by Leigh (Talk | contribs)

Jump to: navigation, search

Spamassassin's Bayesian database needs a blanced suply of both Spam and Ham in order to function properly. By feeding in false positives as Ham, and feeding false negatives as spam, we can keep the bayes database up to date. Spamassassin also provides a facility to report spam to various anit-spam sites such as Razor, Pyzor and SpamCop. Using the mboxadmin facility of Scalix, we can automate this task quite easily. However, we need to be careful about what we feed into the bayes. We can't always trust our users to put spam into the right folders, and we can't expect them to hand-feed ham into our bayes. Many people use a public folder for their spam. This alows everyone to dump their false-negatives into a single folder, and automatically feed it into the bayes. Unfortunately, this doesn't allow for feeding it ham as well, and bayes needs a balanced diet. The other problem with public folders is that they are just that - public. We can't expect users to place ham into a public folder for all to see. Here is a method for ensuring your bayes gets fed a proper balanced diet, and only spam gets fed in as spam, and only spam gets fed in as spam.
Firstly, create an account which has mboxadmin provileges.
Set up two cron jobs on your server. Run this script every hour:

#!/usr/local/bin/perl
use strict;
use warnings;
use Mail::IMAPClient;
my $host="your_mail_server_ip";
my $username="mboxadmin_user_name";
my $password="mboxadmin_password";
my @real_users=`/opt/scalix/bin/omshowu -m all -i`;	# get all real user names.
foreach my $punter (@real_users)			# Loop over them all.
{
	chomp $punter;					# Remove trailing carriage return.
	print "$punter\n";				# Some output. Feel free to remove.
	my $user="mboxadmin:$username:$punter";		# Set up superuser login.
	my $imap  = new Mail::IMAPClient( 'Server' => $host , 'User' => $user , 'Password' => $password  ) or next;	# connect to server.
	my @folders=$imap->folders;			# list folders.
	foreach  my $folder (@folders)			# Look through each of them.
	{
                if (lc($folder) eq "junk e-mail")							      		# "junk email" folder.
                {
                        print "Found a spam folder: $folder\n";
                       $imap->select($folder) or next;                                                                  # Select the folder.
                        print "Folder $folder selected.\n";
                        my @list=$imap->messages or next;                                                              # List all messages in folder.
                        print scalar(@list)." messages in folder.\n";
                        foreach my $msg (reverse(@list))                                                                # Loop over them all.
                        {
                                my @email=$imap->fetch($msg,'RFC822');                                                  # Fetch message.
                                open (SALEARN,"|/usr/bin/spamassassin -d | /usr/bin/sa-learn --spam") or print "$!\n";  # Feed to sa-learn.
                                print SALEARN "$email[1]";
                                close SALEARN;
                                open (REPORT,"|/usr/bin/spamassassin -d | /usr/bin/spamassassin -r") or print "$!\n";   # Report it. (SpamCop and Pyzor).
                                print REPORT "$email[1]";
                                close REPORT;
                                $imap->delete_message($msg) or next;                                                    # Delete it.
                        }
                        $imap->expunge($folder) or next;                                                                #Expunge folder.
                }
	}
}





And this one every week:

#!/usr/bin/perl
use strict;
use warnings;
use Mail::IMAPClient;
my $host="your_server_ip_address";
my $username="mboxadmin_user_name";
my $password="mboxadmin_password";
my @real_users=`/opt/scalix/bin/omshowu -m all -i`;	# get all real user names.
foreach my $punter (@real_users)			# Loop over them all.
{
	chomp $punter;					# Remove trailing carriage return.
	print "$punter\n";				# Some output. Feel free to remove.
	my $user="mboxadmin:$username:$punter";		# Set up superuser login.
	my $imap  = new Mail::IMAPClient( 'Server' => $host , 'User' => $user , 'Password' => $password  ) or next;	# connect to server.
	my @folders=$imap->folders;			# list folders.
	foreach  my $folder (@folders)			# Look through each of them.
	{
		if (lc($folder) eq "inbox")		# "Inbox" is guaranteed to only have ham in it.
		{
			print "Inbox found.\n";		# Some debug output.
			$imap->select($folder) or next;	# Select folder.
			print "Folder $folder selected.\n";
			my @list=$imap->seen or next;	# Get only messages which have been read. Saves the possibility of reading in false positives. Also stops us interfering with people's mail.
			print scalar(@list)." messages in folder.\n";
			my $counter=0;			# Initialise counter. - we don't want the entire inbox.
			foreach my $msg (@list)		# Loop over each message.
			{
				my @email=$imap->fetch($msg,'RFC822');	# Fetch it.
				open (SALEARN,"|/usr/bin/spamassassin -d | /usr/bin/sa-learn --ham") or next;		# Feed it to sa-learn. 
				print SALEARN "$email[1]\n";
				close SALEARN;
				$counter +=1;		# Increment counter.
				last if ($counter>100); # We only want 100 messages.
			}
		}
		elsif (lc($folder) eq "possible spam") 									# "Possible Spam" folder.
		{
			print "Found a spam folder: $folder\n";
                       $imap->select($folder) or next;									# Select the folder.
                        print "Folder $folder selected.\n";
			my $lastweek=time()-604800;									# Get timestamp for this time last week.
			my @list = $imap->before($lastweek) or next; 							# List all messages older than that.
                        print scalar(@list)." messages in folder.\n";
                        foreach my $msg (reverse(@list))								# Loop over them all.
                        {
                                my @email=$imap->fetch($msg,'RFC822');							# Fetch message.
                               	open (SALEARN,"|/usr/bin/spamassassin -d | /usr/bin/sa-learn --spam") or print "$!\n";	# Feed to sa-learn.
                               	print SALEARN "$email[1]";
                               	close SALEARN;
				open (REPORT,"|/usr/bin/spamassassin -d | /usr/bin/spamassassin -r") or print "$!\n";	# Report it. (SpamCop and Pyzor).
				print REPORT "$email[1]";
				close REPORT;
				$imap->delete_message($msg) or next;							# Delete it.
                        }
			$imap->expunge($folder) or next;								#Expunge folder.
		}
		elsif(lc($folder) eq "non-spam")
		{
                       $imap->select($folder) or next;                                                                  # Select the folder.
                        print "Folder $folder selected.\n";
                        my @list=$imap->messages or next;                                                              # List all messages in folder.
                        print scalar(@list)." messages in folder.\n";
                        foreach my $msg (reverse(@list))                                                                # Loop over them all.
                        {
                                my @email=$imap->fetch($msg,'RFC822');                                                  # Fetch message.
                                open (SALEARN,"|/usr/bin/spamassassin -d | /usr/bin/sa-learn --forget") or print "$!\n";# Sa-learn forget this message if already seen.
                                print SALEARN "$email[1]";
                                close SALEARN or print "$!\n";
                                open (SALEARN,"|/usr/bin/spamassassin -d | /usr/bin/sa-learn --ham") or next;          # Feed to sa-learn as ham.
                                print SALEARN "$email[1]";
                                close SALEARN;
                        }
 
		}
                elsif (lc($folder) eq "spam")					                                      # "spam"  folder.
                {
                        print "Found a spam folder: $folder\n";
                       $imap->select($folder) or next;                                                                  # Select the folder.
                        print "Folder $folder selected.\n";
                        my $lastweek=time()-604800;                                                                     # Get timestamp for this time last week.
                        my @list = $imap->before($lastweek) or next;                                                    # List all messages older than that.
                        print scalar(@list)." messages in folder.\n";
                        foreach my $msg (reverse(@list))                                                                # Loop over them all.
                        {
                        	my $subject=$imap->subject($msg);                                                       # Fetch subject for message.
                                my @email=$imap->fetch($msg,'RFC822');                                                  # Fetch message.
                                unless ($subject=~m/\[SPAM\]/)
				{
					print "Learning message with subject: $subject\n";
                                        open (SALEARN,"|/usr/bin/spamassassin -d | /usr/bin/sa-learn --spam") or print "$!\n";  # Feed to sa-learn.
                        	        print SALEARN "$email[1]";
                                       	close SALEARN;
				}
                                open (REPORT,"|/usr/bin/spamassassin -d | /usr/bin/spamassassin -r") or print "$!\n";   # Report it. (SpamCop and Pyzor).
                                print REPORT "$email[1]";
                                close REPORT;
                                $imap->delete_message($msg) or next;                                                    # Delete it.
                        }
                        $imap->expunge($folder) or next;                                                                #Expunge folder.
                }
 
	}
}


The first script, run every hour, checks each user's "junk email" folder. Each message it finds has it's spamassassin headers removed and is fed to sa-learn as spam. It is then submtted to spamassassin's reporting facility to be reported to SpamCop, Pyzor, etc.
For the second script to be effective, we need to set up some rules and educate our users a little.
To be as aggressive as possible with spam, but also as safe as possible, set up two folders for each user: "Spam" and "Possible Spam". Each user then needs two server-side rules: All mail marked as spam by spamassassin goes into the "spam" folder, and all mail not marked as spam, but with a score above 3, goes into the "possible spam" folder. Also create a "non-spam" folder for each user. This is where they are to place copies of legitimate email which gets incorrectly tagged as spam.
Our second script then does the following:
Each user's inbox is scanned. The newest 100 messages which have already been read are fed to sa-learn as ham. This assumes that nobody is going to read a piece of spamand then leave it in their inbox. If they do, they deserve to get more spam, quite frankly.
Each user's "possible spam" folder is also read. Messages which are older than a week are fed to sa-learn as spam and then deleted. There is no point reporting these after they are a week old. If a user does not check their possible spam folder each week, they risk losing mail. This goves them the incentive to keep an eye on it.
The "non-spam" folder is also checked. Anything in here is fed to sa-learn as well. First, sa-learn is told to un-learn this message, in case auto-learn has already classified it as spam, and then it is learnt as ham.
The "spam" folder is then checked. It is almost the same as the "possible spam" folder, except that anything which has already been tagged as spam by spamassassin is not reported. This gives peopole the option of placing spam in the "spam" folder, which is a little more intuitive for them. Also, those not running Outlook 2K3 or later may not have a "junk email" folder.