Difference between revisions of "Care and feeding of your Bayes"

From Scalix Wiki
Jump to: navigation, search
m
Line 4: Line 4:
 
Using the mboxadmin facility of Scalix, we can automate this task quite easily.<BR>
 
Using the mboxadmin facility of Scalix, we can automate this task quite easily.<BR>
 
However, we need to be careful about what we feed into the bayes. We can't always trust our users to put spam into the right folders, and we can't expect them to hand-feed ham into our bayes. Many people use a public folder for their spam. This alows everyone to dump their false-negatives into a single folder, and automatically feed it into the bayes. Unfortunately, this doesn't allow for feeding it ham as well, and bayes needs a balanced diet. The other problem with public folders is that they are just that - public. We can't expect users to place ham into a public folder for all to see.<BR>
 
However, we need to be careful about what we feed into the bayes. We can't always trust our users to put spam into the right folders, and we can't expect them to hand-feed ham into our bayes. Many people use a public folder for their spam. This alows everyone to dump their false-negatives into a single folder, and automatically feed it into the bayes. Unfortunately, this doesn't allow for feeding it ham as well, and bayes needs a balanced diet. The other problem with public folders is that they are just that - public. We can't expect users to place ham into a public folder for all to see.<BR>
Here is a method for ensuring your bayes gets fed a proper balanced diet, and only spam gets fed in as spam, and only spam gets fed in as spam.
+
Here is a method for ensuring your bayes gets fed a proper balanced diet, and only spam gets fed in as spam, and only ham gets fed in as ham.
 
<BR>
 
<BR>
 
Firstly, create an account which has mboxadmin privileges.
 
Firstly, create an account which has mboxadmin privileges.

Revision as of 07:21, 31 July 2006

Spamassassin's Bayesian database needs a balanced supply of both Spam and Ham in order to function properly.
By feeding in false positives as Ham, and feeding false negatives as spam, we can keep the bayes database up to date.
Spamassassin also provides a facility to report spam to various anit-spam sites such as Razor, Pyzor and SpamCop.
Using the mboxadmin facility of Scalix, we can automate this task quite easily.
However, we need to be careful about what we feed into the bayes. We can't always trust our users to put spam into the right folders, and we can't expect them to hand-feed ham into our bayes. Many people use a public folder for their spam. This alows everyone to dump their false-negatives into a single folder, and automatically feed it into the bayes. Unfortunately, this doesn't allow for feeding it ham as well, and bayes needs a balanced diet. The other problem with public folders is that they are just that - public. We can't expect users to place ham into a public folder for all to see.
Here is a method for ensuring your bayes gets fed a proper balanced diet, and only spam gets fed in as spam, and only ham gets fed in as ham.
Firstly, create an account which has mboxadmin privileges.
Set up two cron jobs on your server. Run this script every hour:

#!/usr/local/bin/perl
use strict;
use warnings;
use Mail::IMAPClient;
my $host="your_mail_server_ip";
my $username="mboxadmin_user_name";
my $password="mboxadmin_password";
my @real_users=`/opt/scalix/bin/omshowu -m all -i`;	# get all real user names.
foreach my $punter (@real_users)			# Loop over them all.
{
	chomp $punter;					# Remove trailing carriage return.
	print "$punter\n";				# Some output. Feel free to remove.
	my $user="mboxadmin:$username:$punter";		# Set up superuser login.
	my $imap  = new Mail::IMAPClient( 'Server' => $host , 'User' => $user , 'Password' => $password  ) or next;	# connect to server.
	my @folders=$imap->folders;			# list folders.
	foreach  my $folder (@folders)			# Look through each of them.
	{
                if (lc($folder) eq "junk e-mail")							      		# "junk email" folder.
                {
                        print "Found a spam folder: $folder\n";
                       $imap->select($folder) or next;                                                                  # Select the folder.
                        print "Folder $folder selected.\n";
                        my @list=$imap->messages or next;                                                              # List all messages in folder.
                        print scalar(@list)." messages in folder.\n";
                        foreach my $msg (reverse(@list))                                                                # Loop over them all.
                        {
                                my @email=$imap->fetch($msg,'RFC822');                                                  # Fetch message.
                                open (SALEARN,"|/usr/bin/spamassassin -d | /usr/bin/sa-learn --spam") or print "$!\n";  # Feed to sa-learn.
                                print SALEARN "$email[1]";
                                close SALEARN;
                                open (REPORT,"|/usr/bin/spamassassin -d | /usr/bin/spamassassin -r") or print "$!\n";   # Report it. (SpamCop and Pyzor).
                                print REPORT "$email[1]";
                                close REPORT;
                                $imap->delete_message($msg) or next;                                                    # Delete it.
                        }
                        $imap->expunge($folder) or next;                                                                #Expunge folder.
                }
	}
}





And this one every week:

#!/usr/bin/perl
use strict;
use warnings;
use Mail::IMAPClient;
my $host="your_server_ip_address";
my $username="mboxadmin_user_name";
my $password="mboxadmin_password";
my @real_users=`/opt/scalix/bin/omshowu -m all -i`;	# get all real user names.
foreach my $punter (@real_users)			# Loop over them all.
{
	chomp $punter;					# Remove trailing carriage return.
	print "$punter\n";				# Some output. Feel free to remove.
	my $user="mboxadmin:$username:$punter";		# Set up superuser login.
	my $imap  = new Mail::IMAPClient( 'Server' => $host , 'User' => $user , 'Password' => $password  ) or next;	# connect to server.
	my @folders=$imap->folders;			# list folders.
	foreach  my $folder (@folders)			# Look through each of them.
	{
		if (lc($folder) eq "inbox")		# "Inbox" is guaranteed to only have ham in it.
		{
			print "Inbox found.\n";		# Some debug output.
			$imap->select($folder) or next;	# Select folder.
			print "Folder $folder selected.\n";
			my @list=$imap->seen or next;	# Get only messages which have been read. Saves the possibility of reading in false positives. Also stops us interfering with people's mail.
			print scalar(@list)." messages in folder.\n";
			my $counter=0;			# Initialise counter. - we don't want the entire inbox.
			foreach my $msg (@list)		# Loop over each message.
			{
				my @email=$imap->fetch($msg,'RFC822');	# Fetch it.
				open (SALEARN,"|/usr/bin/spamassassin -d | /usr/bin/sa-learn --ham") or next;		# Feed it to sa-learn. 
				print SALEARN "$email[1]\n";
				close SALEARN;
				$counter +=1;		# Increment counter.
				last if ($counter>100); # We only want 100 messages.
			}
		}
		elsif (lc($folder) eq "possible spam") 									# "Possible Spam" folder.
		{
			print "Found a spam folder: $folder\n";
                       $imap->select($folder) or next;									# Select the folder.
                        print "Folder $folder selected.\n";
			my $lastweek=time()-604800;									# Get timestamp for this time last week.
			my @list = $imap->before($lastweek) or next; 							# List all messages older than that.
                        print scalar(@list)." messages in folder.\n";
                        foreach my $msg (reverse(@list))								# Loop over them all.
                        {
                                my @email=$imap->fetch($msg,'RFC822');							# Fetch message.
                               	open (SALEARN,"|/usr/bin/spamassassin -d | /usr/bin/sa-learn --spam") or print "$!\n";	# Feed to sa-learn.
                               	print SALEARN "$email[1]";
                               	close SALEARN;
				open (REPORT,"|/usr/bin/spamassassin -d | /usr/bin/spamassassin -r") or print "$!\n";	# Report it. (SpamCop and Pyzor).
				print REPORT "$email[1]";
				close REPORT;
				$imap->delete_message($msg) or next;							# Delete it.
                        }
			$imap->expunge($folder) or next;								#Expunge folder.
		}
		elsif(lc($folder) eq "non-spam")
		{
                       $imap->select($folder) or next;                                                                  # Select the folder.
                        print "Folder $folder selected.\n";
                        my @list=$imap->messages or next;                                                              # List all messages in folder.
                        print scalar(@list)." messages in folder.\n";
                        foreach my $msg (reverse(@list))                                                                # Loop over them all.
                        {
                                my @email=$imap->fetch($msg,'RFC822');                                                  # Fetch message.
                                open (SALEARN,"|/usr/bin/spamassassin -d | /usr/bin/sa-learn --forget") or print "$!\n";# Sa-learn forget this message if already seen.
                                print SALEARN "$email[1]";
                                close SALEARN or print "$!\n";
                                open (SALEARN,"|/usr/bin/spamassassin -d | /usr/bin/sa-learn --ham") or next;          # Feed to sa-learn as ham.
                                print SALEARN "$email[1]";
                                close SALEARN;
                        }
 
		}
                elsif (lc($folder) eq "spam")					                                      # "spam"  folder.
                {
                        print "Found a spam folder: $folder\n";
                       $imap->select($folder) or next;                                                                  # Select the folder.
                        print "Folder $folder selected.\n";
                        my $lastweek=time()-604800;                                                                     # Get timestamp for this time last week.
                        my @list = $imap->before($lastweek) or next;                                                    # List all messages older than that.
                        print scalar(@list)." messages in folder.\n";
                        foreach my $msg (reverse(@list))                                                                # Loop over them all.
                        {
                        	my $subject=$imap->subject($msg);                                                       # Fetch subject for message.
                                my @email=$imap->fetch($msg,'RFC822');                                                  # Fetch message.
                                unless ($subject=~m/\[SPAM\]/)
				{
					print "Learning message with subject: $subject\n";
                                        open (SALEARN,"|/usr/bin/spamassassin -d | /usr/bin/sa-learn --spam") or print "$!\n";  # Feed to sa-learn.
                        	        print SALEARN "$email[1]";
                                       	close SALEARN;
				}
                                open (REPORT,"|/usr/bin/spamassassin -d | /usr/bin/spamassassin -r") or print "$!\n";   # Report it. (SpamCop and Pyzor).
                                print REPORT "$email[1]";
                                close REPORT;
                                $imap->delete_message($msg) or next;                                                    # Delete it.
                        }
                        $imap->expunge($folder) or next;                                                                #Expunge folder.
                }
 
	}
}


The first script, run every hour, checks each user's "junk email" folder. Each message it finds has it's spamassassin headers removed and is fed to sa-learn as spam. It is then submtted to spamassassin's reporting facility to be reported to SpamCop, Pyzor, etc.
For the second script to be effective, we need to set up some rules and educate our users a little.
To be as aggressive as possible with spam, but also as safe as possible, set up two folders for each user: "Spam" and "Possible Spam". Each user then needs two server-side rules: All mail marked as spam by spamassassin goes into the "spam" folder, and all mail not marked as spam, but with a score above 3, goes into the "possible spam" folder. Also create a "non-spam" folder for each user. This is where they are to place copies of legitimate email which gets incorrectly tagged as spam.
Our second script then does the following:
Each user's inbox is scanned. The newest 100 messages which have already been read are fed to sa-learn as ham. This assumes that nobody is going to read a piece of spam and then leave it in their inbox. If they do, they deserve to get more spam, quite frankly.
Each user's "possible spam" folder is also read. Messages which are older than a week are fed to sa-learn as spam and then deleted. There is no point reporting these after they are a week old. If a user does not check their possible spam folder each week, they risk losing mail. This gives them the incentive to keep an eye on it.
The "non-spam" folder is also checked. Anything in here is fed to sa-learn as well. First, sa-learn is told to un-learn this message, in case auto-learn has already classified it as spam, and then it is learnt as ham.
The "spam" folder is then checked. It is almost the same as the "possible spam" folder, except that anything which has already been tagged as spam by spamassassin is not reported. This gives peopole the option of placing spam in the "spam" folder, which is a little more intuitive for them. Also, those not running Outlook 2K3 or later may not have a "junk email" folder. To check whether a message has been already tagged as spam, this script looks at the message subject. If it begins with "[SPAM]", it is simply deleted. Feeding lots of messages into your bayes database which have this tag in the header could do more harm than good. Spamassassin may start to think that all spam has the tag "[spam]" in it's subject, and down-grade any messages which don't.
One thing about mboxadmin is worth noting. As at scalix V10, an mboxadmin user cannot access the mailbox of another mboxadmin user. This means you must not have any other mboxadmins on your system, or our scripts will not be able to read their mail.
Another thing worth considering is that at the moment Scalix do not recommend that users run Scalix Connect and IMAP against the same mailbox at the same time. Use your own discretion.
The 100-message limit on the inbox can be changed to suit your site by altering this line:

				last if ($counter>100); # We only want 100 messages.

Season to taste.
Please note: If the setup described above makes a mess of your Bayes, I will not be held responsible. Use this method at your own risk. Make sure you understand the requirements and use your own judgement. Back up your Bayes database first.