Training spamassassin using sa-learn

jedwards · Postby **jedwards** » Mon Feb 20, 2006 4:15 pm

Redirecting is an option, however with Outlook there is no redirect function. If there were, the issue would be that to Outlook, it would appear as if someone other than the users account was sending the mail. This isn't possible with Exchange, unless the admin allows it for each possible sender. Due to the infinite possibilities of senders with a redirect, the list would be a nightmare and therefore prohibitive. So redirect with an Exchange server was moot. The function was not created.

If redirect were possible with Scalix, admin would simply create a local , non Scalix user on the same server Scalix is running. This makes Sendmail place the mail in a normail spool file for the user. Now sa-learn would have no problem running from cron. This was mentioned in this thread.

Obviously a normal user cannot forward to the spam/ham addresses because then the user would be the sender and it would poison the database. Users would be training the filter against themselves.

If the scalix devs would create the ability to redirect, the users would be responsible for classifying their own mail, which seems to work well. Some companies have problems with this because the users are confused. I'm not sure why.

The ideas rendered here are viable, but limited because it involves managing something for everyone joining and leaving a company. The Perl script method would work fine and allow users to classify their own mail, however there's maintenance. It's okay with low employee turnover.

I believe the best Scalix could do for us is to work out the redirect function.

Scalix has a few minor annoyances. However our overall opinion is very high. They've done quite a service by creating a way to run far away from the blasted Exchange mechanism Microsoft calls a mail server.

Jim Edwards
Network Design & Consulting, LLC
800 W. Route 897
Blainsport, PA 17569-9573

leigh · Postby **leigh** » Mon Feb 20, 2006 6:22 pm

Not quite sue I follow you here.
By using two public folders, SPAM and NON_SPAM, all a user has to do is drag their spam into the public folders/spam folder. Occasionally putting some good mail in the non-spam folder helps.
No maintenance required. Just participation on the part of your end-users. And if they don't contribute to the solution, they have no right to complain aout the results, right?
The perl scripts given here work very well. I have them running from Cron very night and have had no hiccups at all.
As a helper, I have a "possible spam" folder, and a rule to file anything with a spamassassin score of 3 or more in it. Anything from there which is spam absolutely MUST go into the public SPAM folder. If it isn't spam (quite rare), a copy goes into the public NON_SPAM folder.
Where's the maintenance in that?

Postby **ScalixSupport** » Mon Feb 20, 2006 6:44 pm

The Scalix server allows you to redirect and it's something that you can set up using the sxaa script in the admin_resource_kit/ directory.

However, like you've pointed out, the Outlook UI doesn't allow a user to forward on a message without affecting the headers. So, typically, this needs to be done on receipt of the message.

The downside to this approach is that if you have a false positive, some action is needed for the end-user to be able to detect this and retrieve the message from where ever it was redirected. This is why a participative approach, where the end-user actively indicates that something is spam, works best.

Cheers

Dave

jedwards · Postby **jedwards** » Mon Feb 20, 2006 7:22 pm

leigh - yes, the scripts work. Where I went off course is that I thought that the purpose of the script was to access each persons junk folder. I apologize for not reading more carefully. The maintenance I was referring to would have been the list of users, which would change as people came and went. However I see this isn't the case.

Jim Edwards
Network Design & Consulting, LLC
800 W. Route 897
Blainsport, PA 17569-9573

leigh · Postby **leigh** » Mon Feb 20, 2006 7:58 pm

All depends on how you set it up.
The perl sceipt given takes a username and password on it's command line, as well as the folder you want to access. Simply use "public folders\spam", or whatever your spam folder is.
Make sure you use the modifications I mentioned, though, or it won't access public folders properly.
Doing it this way, you only need IMAP access, and a public folder which everyone can drop into, so permissions are a breeze. The contents of the folders get deleted when the script runs, so it's not hanging around.
Make sure you install Mail::IMAPClient or it won't work.

mhanisch · Postby **mhanisch** » Tue Feb 21, 2006 6:57 am

Maybe a few remarks from my side, since I also wrote a couple of posts on this matter:
What I don't like about a Public Folder for spam is that you would also need a public folder for ham, which raises some privacy concerns, since other users would be able to access the spam therein.
Besides, you cannot move spam emails to that folder automatically - the volume would be so high that users would not be able to check it for false positives originally sent to their own account. (Of course, they could filter/search that folder, but I guess they simply won't...)
This might be less of a problem if you split your spam messages into "possibly spam" and "definitely spam", which will likely reduce the amount of false positives.

Anyway, since installing Scalix 10, we've decided to do it the following way:

create a rule to redirect all Spam emails, using sxaa;
this way, all users will have the same rule in place; the rule also takes care of automatically creating the "spam" folder if necessary.
Note that this way, every user will have her own private "spam" folder.
for storing "ham", the users' inboxes will be used (for the time being
only the INBOX itself, not other folders below it; this may change)
This setup allows the users to classify messages as spam or ham, all within their own folder structure
we have created an account with "mboxadmin" capability to access the
ham and spam messages
there's a cron job that goes through all recent lists,
invokes sa_learn and then deletes the messages from the spam folder

The cron job is a small (Perl) script that does the following:

get the list of users using LDAP
for each user, access the ham and spam folders, and save the
last X days worth of messages in a mbox file
(logging in using the "mboxadmin" account)
invoke sa_learn on the mbox file and delete the file again
proceed with the next user

Since the list of users is retrieved automatically, there's no maintenance involved.

There's one caveat, though:
creating an account with the mboxadmin capability/permission effectively opens a pretty easy possibility for spying on your users. Here (Germany) this requires consent from the employees (as long as you allow them to receive/send private emails).
So while the technical problems have been solved, this organizational issue remains open, but I'm pretty sure that the users would not mind, as this approach to spam filtering gives them pretty good control over what happens with their messages, while keeping the manual work required to a minimum.

mephisto · Postby **mephisto** » Tue Feb 21, 2006 9:04 am

Would you mind providing us with this script?

jedwards · Postby **jedwards** » Tue Feb 21, 2006 12:14 pm

Returning to the perl scipt concept using Public Folders, can you not deny read to users for Hams? Wouldn't this allow users to drop mails but not be able to read them?

Jim Edwards

jedwards · Postby **jedwards** » Tue Feb 21, 2006 5:15 pm

It seems that using the perl script, at least for me, produces a file which sa-learn doesn't like. All I get is the following:

sa-learn --showdots --mbox --spam spam

Learned from 0 messages.

In 1982 I had an instructor tell the class: "Remember, never overlook the obvious"
I still do, can't help it. Achems Razor. I'm lost here.

When I get the spam corpus from the previous server and run sa-learn against it, it works fine.

Anyone able to offer suggestions? My client is beginning to think I'm an idiot.

Jim Edwards

leigh · Postby **leigh** » Wed Feb 22, 2006 11:02 pm

I would not suggest putting al your spam automatically into the spam folder for Spamassassin to sa-learn.
If SpamAssassin has already tagged a message as spam, it doesn't need to learn from it agin, does it? It already knows.
I have two spam rule son my incoming mail:
If the header contains "X-Spam-Status: Yes", it goes straight into my (private) spam folder and I never bother with it.
If the header contains "X-Spam-Status: Yes" and also contains "X-Spam-Level: ***", it goes into a "possible sam" folder. This is the only one I bother with. Essentially, it's stuff that Spamassassin has scored 3 or more on, but doesn't consider it spam. This is the stuff it needs to learn about.
I also make sure I feed it plenty of Ham. I have told everybody to put copies of non-confidential stuff in the ham folder.

Have you had a look at the file which the perl script produces? does it look like an mbox file? Better yet, is there any spam in the folder it's looking at in the first place?

jedwards · Postby **jedwards** » Thu Feb 23, 2006 12:35 am

Well, these are the 'obvious' things one might overlook :D There is spam in the public folder. It was moved from the inbox. My filters are set up pretty much the same way, so the X-Spam-Flag: YES sends it to the junk box - which is not what I use for sa-learn. I was only placing mails which we considered spam, which arrived in the inbox (X-Spam-Flag: No) into the public folder which imaps out to sa-learn.

When I look at the output file produced, it begins with different headers other than From, however I don't think that's going to make any difference to the parser. There's ~ 750 spams in there.

Jim

leigh · Postby **leigh** » Thu Feb 23, 2006 12:45 am

750?!?! Are you deleting them after the IMAP script extracts them?

OK, Here's what ou should be seeing:

-- Messages in Public Folders/SPAM --
11 messages deleted
export of imap folder to mbox format finished
Learning SPAM in spambox . . .
11 message(s) . . .
Learned from 1 message(s) (1 message(s) examined).
Learned from 1 message(s) (1 message(s) examined).
Learned from 1 message(s) (1 message(s) examined).
Learned from 1 message(s) (1 message(s) examined).
Learned from 1 message(s) (1 message(s) examined).
Learned from 1 message(s) (1 message(s) examined).
Learned from 1 message(s) (1 message(s) examined).
Learned from 1 message(s) (1 message(s) examined).
Learned from 1 message(s) (1 message(s) examined).
Learned from 1 message(s) (1 message(s) examined).
Learned from 1 message(s) (1 message(s) examined).

Just so you get an idea of what the mbox file looks like, here's the start of mine:

Return-Path: <kpknhmrtzufsj@thirdkind.com>
Received: from <my.server.name>(localhost.localdomain [127.0.0.1])
by <my.server.name>(8.13.4/8.13.4) with ESMTP id k1MGm7GK009187
for <my email>; Thu, 23 Feb 2006 03:48:07 +1100
Received: from <my.server.name>(root@localhost)
by <my.server.name> (8.13.4/8.13.4/Submit) with ESMTP id k1MGm70M009185
for <my email>; Thu, 23 Feb 2006 03:48:07 +1100
Received: from <another.server.name>(another.server.name and IP address)
by <my.server.name> (Scalix SMTP Relay 10.0.0.175)
via ESMTP; Thu, 23 Feb 2006 03:48:07 +1100 (EST)
Received: from another.server.name (another.server.name [IP address])
by another.server.name (8.11.6/8.11.6) with ESMTP id k1MGm6514273
for <my@email address>; Thu, 23 Feb 2006 03:48:06 +1100
Received: from 71-12-24-232.dhcp.gnvl.sc.charter.com (71-12-24-232.dhcp.gnvl.sc.charter.com [71.12.24.232])
by another.server.name (8.12.11/8.12.11) with SMTP id k1MGm3eI030073
for <my@email address>; Thu, 23 Feb 2006 03:48:05 +1100
Received: from .anu..au ([1] helo=anu..au)
by smtp6..co with esmtp
id 1A5Ys6-865148-05
Date: Wed, 22 Feb 2006 17:43:49 +0100
From: "Nina Thacker" <kpknhmrtzufsj@thirdkind.com>
Sender: freeradius-devel-kpknhmrtzufsj@thirdkind.com
To:my@email.address
Message-ID: <NCBAKEOAA..@cde.Com>
Subject: Automation System
X-Mailman-Version: 2.0.1
X-Spam-Status: No, score=-0.1 required=5.0 tests=ALL_TRUSTED,BAYES_50,
URIBL_OB_SURBL autolearn=ham version=3.0.3
X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on
mail.pacificwireless.net.au
MIME-Version: 1.0
Content-Type: text/plain;
charset="US-ASCII"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

"Ci-iallis Sof-tabs" is better than Pfizer V-iiaggrra
and normal Ci-ialis because:

So, firstly, does it look like this?
What is producing this line:

sa-learn --showdots --mbox --spam spam

As far as I can tell, the scripts given in this thread won't produce that output.

jedwards · Postby **jedwards** » Thu Feb 23, 2006 1:06 am

Yes, 750. That's a small sample of what comes into this place.

I don't see in the Perl script I have where it runs sa-learn. I cut the perl script out of this thread, and then modified it as you'd suggested.

I typed the command you mentioned myself.

sa-learn --showdots --mbox --spam spam

--showdots makes sa-learn show progress
--mbox is the format of the file being scanned
--spam means it's learning spam, not ham
spam is the name of the mbox file being scanned

Yes, the mbox file looks like yours.

Jim

leigh · Postby **leigh** » Thu Feb 23, 2006 1:15 am

I don't see in the Perl script I have where it runs sa-learn. I cut the perl script out of this thread, and then modified it as you'd suggested.

AHA, herein lies the problem.
The perl script simply IMAPs the mails off and then creates an mbox file from them. sa-learn can't talk straight mbox. There's a shell script in the same posting which pulls the mbox file apart and feeds it to sa-learn. Run them both in order, and then the same for ham.

pete · Postby **pete** » Thu Feb 23, 2006 1:23 pm

--removed--

Scalix Forums

Training spamassassin using sa-learn

Redirect

storing spam in the users' private folders

Who is online