Page 1 of 4

Training spamassassin using sa-learn

Posted: Tue Jan 10, 2006 6:24 pm
by ah4279
We are just transitioning from a WinBloze based mail server to FC4/Scalix and we are getting a lot of spam. Using the out of box spamassassin rules and scores we are catching ~60% of the spam messages.

I would like my users to save the spam messages to that I can eventually train spamassassin with sa-learn after we have built up a large set of spam.

My question is, how do I go about getting the spam and ham messages from a users account into a format that spamassassin will train from?

Any suggestions?

Posted: Tue Jan 10, 2006 6:38 pm
by kali
Great question! And here was my solution:

Created a unix user "spam" (not a scalix user). Then sendmail will deliver mail sent to spam@localhost.localdomain to the spam mailbox (mbox format) in /var/spool/mail. sa-learn can read and learn from that. You have, I think, two choices on how to get mail into that box. First - you can "resend" (a bounce) the mail to that address. Or second - you can use a separate imap process and drag/drop the messages in. I do both - but it is hard to inform users how to bounce a message rather than forward. The message needs to be bounced in order for the headers to be correct for SA to learn from.

Just a couple of thoughts worth considering...

Thanks for the suggestion

Posted: Wed Jan 11, 2006 11:26 am
by ah4279
That's a great suggestion, I've created a spam account and I'll get my users to redirect their spam messages to the local unix account.

I'm still wondering if there is a way to accomplish this task with the scalix tools.

Posted: Wed Jan 11, 2006 8:55 pm
by mephisto
It would be perfect if there was a possibility to use the user's Junk-Folder for that. This way you can use Outlook's (or Thunderbird's) Spam function and also make SA learn from that.
I'm thinking of a tool that checks this folder once a day and purges messages older than a specified time. Could this be realized?

Posted: Thu Jan 12, 2006 1:22 am
by kali
I agree. But I have not thought of a good way to move messages from the Scalix mailstore to an mbox folder... other than imap's drag/drop, which could also be a script I suppose...

I also like the idea of a public folder for spam, so any user can move messages into that folder, then I only have to move messages from one folder to the mbox file.

Posted: Thu Jan 12, 2006 7:17 pm
by pete
This is something that I am also trying to implement. However, I think I read somewhere that
it is better to use per-user filtering and also that SpamAssassin needs a good supply of
HAM as well as SPAM to train. I think the problem is that it could associate particular user
names with SPAM as part of its bayesian algorithm. This means that each user would have
to be processed separately on both their inbox and SPAM.

Or maybe I was dreaming?

Also, posting this gets me on the watch list for this topic :)

P

Posted: Thu Feb 02, 2006 2:29 pm
by santo
I was also watching this topic, but because no concrete howto's came out, I decided to try it out myself.

This is what I've come up with so far:

1) place copies of incoming messages which are marked as spam by Scalix into Inbox/salearn/spam
2) place copies of incoming messages which are not marked as spam by Scalix into Inbox/salearn/ham

=> Currently I'm using outlook rules for this (because it's not possible with the server side rules of scalix as far as I know).
This means that the messages are only copied whenever my outlook is running.
Note that it is advisable to regularly check those folders as it's still possible spam is placed in the ham folder and vice versa (that's the reason we want spamassassin to learn after all)

3) set up a cronjob to export messages in Inbox/salearn/spam to spam.mbox
4) set up a cronjob to export messages in Inbox/salearn/ham to ham.mbox

=> For this export I'm using a small perl script that I found somewhere on the net (I'm sorry, but I don't remember where I found it) which I modified and tweaked a bit to my needs.

Here is the code of the script (imap2mbox.pl)

Code: Select all

#!/usr/bin/perl
use strict;
use warnings;
use Mail::IMAPClient;

my $usage =
"ARGS must be :
\targv1 : mbox file
\targv2 : imap host
\targv3 : imap user (password will be prompted)
\targv4 : destination mailbox on imap server
\targv5 : password\n";

die($usage) if(@ARGV != 5);
my ($file,$host,$user,$dest,$password) = @ARGV;

my $imap  = new Mail::IMAPClient( 'Server' => $host , 'User' => $user , 'Password' => $password  ) or die "Unable to connect to imap
 server\n";

foreach my $folder ($imap->folders) {
        $imap->select($folder) or die "Unable to select folder $@\n";
        if ($folder eq $dest) {
                print "-- Messages in $folder --\n";
                open (MBOX_SPAM, ">$file") or die "Can't open file $file: $!";
                my @list = $imap->messages or die "$folder: Unable to fetch message list $@";
                foreach my $mess (@list){
                                my @output = $imap->fetch(($mess,'RFC822')) or die "Unable to fetch $@";
                                print MBOX_SPAM "$output[1]" if(defined($output[1]));
                }
                ### Remove seen messages, because we don't need them anymore
                my $nrDeleted = $imap->delete_message( scalar($imap->seen) ) or warn "Could not delete_message: $@\n";
                print "$nrDeleted messages deleted\n";

                ### Ok, the messages are deleted, but in fact they aren't (welcome to IMAP ;-))
                ### So, we should expunge the folder to actually delete the messages
                $imap->expunge($folder) or die "Could not expunge: $@\n";

                close (MBOX_SPAM);
        ### Exit foreach, because we handled the required folder and there's no need to loop further
        ### over the remaining folders
                last;
        }
}
$imap->disconnect() or die "Unable to disconnect\n";

print "export of imap folder to mbox format finished\n";


5) set up a cronjob to "sa-learn" all spam in spam.mbox
6) set up a cronjob to "sa-learn" all ham in ham.mbox

=> For this I use a bash-script which I found somewhere on the net (I'm sorry, but also for this script I don't remember where I got it).

Here is the code of the script (spamlearn). Just make of copy of it called hamlearn and replace "/usr/local/data/spam.mbox" with "/usr/local/data/ham.mbox" at the top of the file are even better: make it an argument.

Code: Select all

#!/bin/bash

# This script takes a mail file full of SPAM and sa-learns it for you.
# sa-learn apparently will not split the mails apart to learn them. this
# script splits the mails in the mail file apart, runs them thru
# spamassassin -d to remove the markup, and feeds them to sa-learn.

# Specify the file on the command line, or change it here:
# this is the file with the spam you need to sa-learn
spamfile='/usr/local/data/spam.mbox'

# Override if you've specified one on the command line
if [[ "$1!" != "!" ]]; then spamfile=$1; fi

# Temp directory:
tmpdr="/tmp/"

if ( ! [ -r $spamfile ] ) ; then echo "Can't read $spamfile ... does it exist?"
exit ; fi

echo "Learning SPAM in $spamfile . . ."

# Let's copy your file, so if it is changed while we're working with it,
# we're ok. (TODO: implement locking?)
spamrnd="${tmpdr}spam${RANDOM}"
cp $spamfile $spamrnd
spamfile=$spamrnd

# this is a temporary file used for processing
tmpfile="${tmpdr}tmp${RANDOM}"

# this is the regular expression I stole from grepmail
# tmpfile will have a list of the line numbers that start new emails:
# CREDIT: Written by David Coppit (david@coppit.org, http://coppit.org/)
grep --extended-regexp --line-number "^(Return-Path: .*|X-Draft-From: .*|X-From-Line: .*|From [^:]+(:[0-9][0-9]){1,2} ([A-Z]{2,3} [0
-9]{4}|[0-9]{4} [+-][0-9]{4}|[0-9]{4})( remote from .*)?)\$" $spamfile | sed "s/:.*//" > $tmpfile

# nummails will have the number of emails:
cp $tmpfile $tmpfile.copy
nummails=`grep -c . $tmpfile`

echo "$nummails message(s) . . ."

# now we can seperate out the emails and work on them.

for ((x=1; x<nummails; x++)); do
linea=`awk -v a=$x -- '{ if (FNR == a) print }' < $tmpfile`
lineb=`awk -v a=$((x+1)) -- '{ if (FNR == a) print }' < $tmpfile`
awk -v a=$linea -v b=$lineb -- '{ if ((FNR>=a)&&(FNR<b)) print }' < $spamfile | spamassassin -d | sa-learn --spam ; done

linea=`awk -v a=$x -- '{ if (FNR == a) print }' < $tmpfile`
awk -v a=$linea -- '{ if (FNR>=a) print }' < $spamfile | spamassassin -d | sa-learn --spam

rm -f $tmpfile
rm -f $spamfile


And this is the local.cf file I use for spamassassin:

Code: Select all

# Add your own customisations to this file.  See 'man Mail::SpamAssassin::Conf'
# for details of what can be tweaked.
#


# do not change the subject
# to change the subject, e.g. use
# rewrite_header Subject ****SPAM(_SCORE_)****
rewrite_header Subject

# Set the score required before a mail is considered spam.
required_score 3.50

# Encapsulate spam in an attachment (0=no, 1=yes, 2=safe)
report_safe             1

# Enable the Bayes system
use_bayes               1

# Enable Bayes auto-learning
bayes_auto_learn              1

# Enable or disable network checks
skip_rbl_checks         0
use_razor2              1
use_dcc                 1
use_pyzor               1

# Mail using languages used in these country codes will not be marked
# as being possibly spam in a foreign language.
# - dutch english french german
ok_languages            nl en fr de

# Mail using locales used in these country codes will not be marked
# as being possibly spam in a foreign language.
ok_locales              en



according to this output, spamassassin is learning now (and that's what we wanted, isn't it ;-)) :

Code: Select all

vmsrv-scalix:~ # sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0        100          0  non-token data: nspam
0.000          0        617          0  non-token data: nham
0.000          0      74497          0  non-token data: ntokens
0.000          0 1119868955          0  non-token data: oldest atime
0.000          0 1138816237          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal sync atime
0.000          0          0          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire atime delta
0.000          0          0          0  non-token data: last expire reduction count



Finally I also added a cronjob which runs sa-learn --sync each night (after the other jobs).
Not sure if this is necessary, but I found some references to "sa-learn --rebuild" which should be run after other sa-learn jobs.
But "sa-learn --rebuild" seams to be deprecated and replaced by "sa-learn --sync"


Please note that this is based on info I gathered in this and several other forums, together with lots of googling, so no garantees here...

Posted: Thu Feb 02, 2006 3:20 pm
by pete
Hi, Santo:

That sound pretty reasonable. I was playing with a similar idea, only using a global SPAM
and HAM mailbox, defined as a user delegating its SPAM And HAM mailboxes to all other
users. Users move or copy (with drag 'n drop) their spam/ham into the correct box. I have
a small script which uses fetchmail to get the contents of SPAM and HAM into a system
mailbox, then runs sa-learn on it.

This works but seems too complicated to me. I like the idea of using your PERL script
instead of fetchmail as it won't inject additional headers into the message. Also, I'm taking a
look at DSPAM, but that means installing a border gateway to receive all inbound mail....

I'll keep trying stuff and I'll post whetever I come up with here.

P

Posted: Fri Feb 03, 2006 1:35 pm
by pviglucci
An easy (manual) way to get the messages into a format that sa-learn can use is to connect with Thunderbird through IMAP. The default message format of Thunderbird is mbox. It's simply a matter of connecting, downloading the messages, copying the Thunderbird mbox file to the server, and running sa-learn.


Pete

Posted: Fri Feb 03, 2006 1:44 pm
by pete
One of the issues is that in order to get to a users mailstore, you need to know their
password. It seems that there is no administrative override to be able to read a users mail.
As I don't know all of my users passwords, I am a bit out of luck with this. That is why I am
looking at public folders or a global user for the users to stick their spam/ham into for
processing.

Scalix support - is there a way for an Administrator to access a users mailstore? (maybe I
should start a new thread on this?)

/Pete

Using mbox admin rights in scalix 10 for imap user access

Posted: Sat Feb 04, 2006 12:47 pm
by ScalixSupport
Hi Pete,

Scalix 10 will have a feature that allows non-admin mailboxes to be accessed using a login that has mbox admin rights.

-Kent

Re: Training spamassassin using sa-learn

Posted: Mon Feb 06, 2006 5:04 am
by ScalixSupport
ah4279 wrote:We are just transitioning from a WinBloze based mail server to FC4/Scalix and we are getting a lot of spam. Using the out of box spamassassin rules and scores we are catching ~60% of the spam messages.

I would like my users to save the spam messages to that I can eventually train spamassassin with sa-learn after we have built up a large set of spam.

My question is, how do I go about getting the spam and ham messages from a users account into a format that spamassassin will train from?

Any suggestions?


A very interesting thread. I'd like to add that a 60% spam detection rate is very low. I wonder what tests you are running and if you ran into the same trap as many SA users seem to. Try running SA in debug mode and verify the user that runs it has read rights to the /etc/mail/spamassassin rules directory.

Cheers,

Sascha.

Posted: Tue Feb 07, 2006 4:42 pm
by kali
I agree. 60% is very low for spam assassin properly configured. I do use some blacklists at the gateway (saves on processing spam mail) then tweak some of the SA scores to be more reasonable. I DO up the Bayes scores as the more "learned" it gets, the more accurate it becomes.

On average - my gateways kill 96%+ of all incoming spam.

Posted: Mon Feb 13, 2006 1:42 am
by leigh
One change I would suggest to your code. In your perl IMAP script, change this line:

Code: Select all

        $imap->select($folder) or die "Unable to select folder $@\n";

to

Code: Select all

        $imap->select($folder) or next;


Otherwise it will die when trying to select "Public Folders".
If you happen to have your spam/ham folders inside Public Folders, it will never get to them.

Posted: Wed Feb 15, 2006 9:06 am
by STXRich
Just to add another avenue to this thread. Make sure you're running the latest version of spam-assassin. Which I believe is 3.0 or 3.1. The older version was real bad with spam. Once I upgraded it had a definate improvement.

I would also suggest looking into getting Razor hooked into spamassassin. Once I added that in, it's been kicking the crap out of our spam mail.