I was also watching this topic, but because no concrete howto's came out, I decided to try it out myself.
This is what I've come up with so far:
1) place copies of incoming messages which are marked as spam by Scalix into Inbox/salearn/spam
2) place copies of incoming messages which are not marked as spam by Scalix into Inbox/salearn/ham
=> Currently I'm using outlook rules for this (because it's not possible with the server side rules of scalix as far as I know).
This means that the messages are only copied whenever my outlook is running.
Note that it is advisable to regularly check those folders as it's still possible spam is placed in the ham folder and vice versa (that's the reason we want spamassassin to learn after all)
3) set up a cronjob to export messages in Inbox/salearn/spam to spam.mbox
4) set up a cronjob to export messages in Inbox/salearn/ham to ham.mbox
=> For this export I'm using a small perl script that I found somewhere on the net (I'm sorry, but I don't remember where I found it) which I modified and tweaked a bit to my needs.
Here is the code of the script (imap2mbox.pl)
Code: Select all
#!/usr/bin/perl
use strict;
use warnings;
use Mail::IMAPClient;
my $usage =
"ARGS must be :
\targv1 : mbox file
\targv2 : imap host
\targv3 : imap user (password will be prompted)
\targv4 : destination mailbox on imap server
\targv5 : password\n";
die($usage) if(@ARGV != 5);
my ($file,$host,$user,$dest,$password) = @ARGV;
my $imap = new Mail::IMAPClient( 'Server' => $host , 'User' => $user , 'Password' => $password ) or die "Unable to connect to imap
server\n";
foreach my $folder ($imap->folders) {
$imap->select($folder) or die "Unable to select folder $@\n";
if ($folder eq $dest) {
print "-- Messages in $folder --\n";
open (MBOX_SPAM, ">$file") or die "Can't open file $file: $!";
my @list = $imap->messages or die "$folder: Unable to fetch message list $@";
foreach my $mess (@list){
my @output = $imap->fetch(($mess,'RFC822')) or die "Unable to fetch $@";
print MBOX_SPAM "$output[1]" if(defined($output[1]));
}
### Remove seen messages, because we don't need them anymore
my $nrDeleted = $imap->delete_message( scalar($imap->seen) ) or warn "Could not delete_message: $@\n";
print "$nrDeleted messages deleted\n";
### Ok, the messages are deleted, but in fact they aren't (welcome to IMAP ;-))
### So, we should expunge the folder to actually delete the messages
$imap->expunge($folder) or die "Could not expunge: $@\n";
close (MBOX_SPAM);
### Exit foreach, because we handled the required folder and there's no need to loop further
### over the remaining folders
last;
}
}
$imap->disconnect() or die "Unable to disconnect\n";
print "export of imap folder to mbox format finished\n";
5) set up a cronjob to "sa-learn" all spam in spam.mbox
6) set up a cronjob to "sa-learn" all ham in ham.mbox
=> For this I use a bash-script which I found somewhere on the net (I'm sorry, but also for this script I don't remember where I got it).
Here is the code of the script (spamlearn). Just make of copy of it called hamlearn and replace "/usr/local/data/spam.mbox" with "/usr/local/data/ham.mbox" at the top of the file are even better: make it an argument.
Code: Select all
#!/bin/bash
# This script takes a mail file full of SPAM and sa-learns it for you.
# sa-learn apparently will not split the mails apart to learn them. this
# script splits the mails in the mail file apart, runs them thru
# spamassassin -d to remove the markup, and feeds them to sa-learn.
# Specify the file on the command line, or change it here:
# this is the file with the spam you need to sa-learn
spamfile='/usr/local/data/spam.mbox'
# Override if you've specified one on the command line
if [[ "$1!" != "!" ]]; then spamfile=$1; fi
# Temp directory:
tmpdr="/tmp/"
if ( ! [ -r $spamfile ] ) ; then echo "Can't read $spamfile ... does it exist?"
exit ; fi
echo "Learning SPAM in $spamfile . . ."
# Let's copy your file, so if it is changed while we're working with it,
# we're ok. (TODO: implement locking?)
spamrnd="${tmpdr}spam${RANDOM}"
cp $spamfile $spamrnd
spamfile=$spamrnd
# this is a temporary file used for processing
tmpfile="${tmpdr}tmp${RANDOM}"
# this is the regular expression I stole from grepmail
# tmpfile will have a list of the line numbers that start new emails:
# CREDIT: Written by David Coppit (david@coppit.org, http://coppit.org/)
grep --extended-regexp --line-number "^(Return-Path: .*|X-Draft-From: .*|X-From-Line: .*|From [^:]+(:[0-9][0-9]){1,2} ([A-Z]{2,3} [0
-9]{4}|[0-9]{4} [+-][0-9]{4}|[0-9]{4})( remote from .*)?)\$" $spamfile | sed "s/:.*//" > $tmpfile
# nummails will have the number of emails:
cp $tmpfile $tmpfile.copy
nummails=`grep -c . $tmpfile`
echo "$nummails message(s) . . ."
# now we can seperate out the emails and work on them.
for ((x=1; x<nummails; x++)); do
linea=`awk -v a=$x -- '{ if (FNR == a) print }' < $tmpfile`
lineb=`awk -v a=$((x+1)) -- '{ if (FNR == a) print }' < $tmpfile`
awk -v a=$linea -v b=$lineb -- '{ if ((FNR>=a)&&(FNR<b)) print }' < $spamfile | spamassassin -d | sa-learn --spam ; done
linea=`awk -v a=$x -- '{ if (FNR == a) print }' < $tmpfile`
awk -v a=$linea -- '{ if (FNR>=a) print }' < $spamfile | spamassassin -d | sa-learn --spam
rm -f $tmpfile
rm -f $spamfile
And this is the local.cf file I use for spamassassin:
Code: Select all
# Add your own customisations to this file. See 'man Mail::SpamAssassin::Conf'
# for details of what can be tweaked.
#
# do not change the subject
# to change the subject, e.g. use
# rewrite_header Subject ****SPAM(_SCORE_)****
rewrite_header Subject
# Set the score required before a mail is considered spam.
required_score 3.50
# Encapsulate spam in an attachment (0=no, 1=yes, 2=safe)
report_safe 1
# Enable the Bayes system
use_bayes 1
# Enable Bayes auto-learning
bayes_auto_learn 1
# Enable or disable network checks
skip_rbl_checks 0
use_razor2 1
use_dcc 1
use_pyzor 1
# Mail using languages used in these country codes will not be marked
# as being possibly spam in a foreign language.
# - dutch english french german
ok_languages nl en fr de
# Mail using locales used in these country codes will not be marked
# as being possibly spam in a foreign language.
ok_locales en
according to this output, spamassassin is learning now (and that's what we wanted, isn't it ;-)) :
Code: Select all
vmsrv-scalix:~ # sa-learn --dump magic
0.000 0 3 0 non-token data: bayes db version
0.000 0 100 0 non-token data: nspam
0.000 0 617 0 non-token data: nham
0.000 0 74497 0 non-token data: ntokens
0.000 0 1119868955 0 non-token data: oldest atime
0.000 0 1138816237 0 non-token data: newest atime
0.000 0 0 0 non-token data: last journal sync atime
0.000 0 0 0 non-token data: last expiry atime
0.000 0 0 0 non-token data: last expire atime delta
0.000 0 0 0 non-token data: last expire reduction count
Finally I also added a cronjob which runs sa-learn --sync each night (after the other jobs).
Not sure if this is necessary, but I found some references to "sa-learn --rebuild" which should be run after other sa-learn jobs.
But "sa-learn --rebuild" seams to be deprecated and replaced by "sa-learn --sync"
Please note that this is based on info I gathered in this and several other forums, together with lots of googling, so no garantees here...