Page 1 of 1

local queue growing, mail not being delivered to scalix user

Posted: Mon Apr 23, 2007 9:10 pm
by jmherr
Hello, every two days or so we end up having our local delivery queue back up and it's hit or miss whether we can get it back by omoff/on ld. Usually if we omshut and omrc it'll come back but this time after a day of trying different combo's we're at 1843 messages backed up.

Is there a way to see what message the queue is choking on? Any ideas?

We're on server version 11.0.2.17

omstat -s:
Service Router Started 19:17:22 0
Local Delivery Started 19:17:22 1847
Internet Mail Gateway Started 19:17:22 0
Sendmail Interface Started 19:17:22 0
Local Client Interface Enabled 19:17:22 1
Remote Client Interface Enabled 19:17:22 22
Test Server Started 19:17:22 0
Request Server Started 19:17:22 0
Print Server Started 19:17:22 0
Directory Synchronization Started 19:17:22 0
Bulletin Board Server Started 19:17:22 0
Background Search Service Started 19:17:22 0
Dump Server Started 19:17:22 0
CDA Server Started 19:17:22 0
POP3 interface Started 19:17:22 0
Omscan Server Started 19:17:22 0
Archiver Started 19:17:22 0

omstat -a:
PC Monitor Started NON-STOP 0
Directory Relay Server Started 19:17:21
Notification Server Started 19:17:21 0
Shared memory daemon Started NON-STOP
Notification Monitor Started NON-STOP
Session Monitor Started NON-STOP
Indexer Started NON-STOP
Stats Daemon Started NON-STOP
Container Access Monitor Started NON-STOP
Item Structure Server Stopped 02.11.07
Database Monitor Started 19:17:21
Licence Monitor Daemon Started NON-STOP
LDAP Daemon Started 19:17:21
Queue Manager Started NON-STOP
Item Delete Daemon Started NON-STOP
IMAP Server Daemon Started 19:17:21
SMTP Relay Started 19:17:21
Mime Browser Controller Started 19:50:28
Event Server Started 19:17:21

omshowlog has lots of errors but last errors are mostly ERROR Browser (Service 14 ) 04.23.07 20:02:06
[OM.MIME 4000] Browser Args :index.browse -c -o /var/opt/scalix/cn/s/temp/mime_cache/mimenP98cy 0014b37bfd0a505c
Last Msg Id: 20070417160001.BA67D47C081(a)smtp1.xyz.net
Last Msg DirectRef: 00109576b48a13fa


ERROR Browser (Service 14 ) 04.23.07 20:02:10
[OM.MIME 4000] Browser Args :index.browse -c -o /var/opt/scalix/cn/s/temp/mime_cache/mimeZrf4QT 0014b37bfd0a505c
Last Msg Id: 20070417160001.4F5B847C081(a)smtp1.xyz.net
Last Msg DirectRef: 0011ef47c81e0a28


ERROR Browser (Service 14 ) 04.23.07 20:02:10
[OM.MIME 4000] Browser Args :index.browse -c -o /var/opt/scalix/cn/s/temp/mime_cache/mimeWjrvrp 0014b37bfd0a505c
Last Msg Id: 20070417160001.4F5B847C081(a)smtp1.xyz.net
Last Msg DirectRef: 001293c9a783fb6d


ERROR Browser (Service 14 ) 04.23.07 20:02:13
[OM.MIME 4000] Browser Args :index.browse -c -o /var/opt/scalix/cn/s/temp/mime_cache/mimetmGyB3 0014b37bfd0a505c

But previously today had lots of SYS 22 Invalid Argument errors

I'm running an omtidyallu -M but last time it never finished after a week. I'm also running omcheck -s -d > /root/omcheck.sh incase it's a permission error but I don't see how this would keep occuring.

Sorry for the long post, any help is quite appreciated! (sorry.. I replaced one of our smtp servers with the xyz.net however it has these errors for any incoming mail servers (external as well)
- John

Posted: Tue Apr 24, 2007 4:03 am
by ScalixSupport
Hey John!

Check this post:
viewtopic.php?t=7157

Let me know if this helped or not.

Thanks,
Subir

Posted: Tue Apr 24, 2007 1:27 pm
by jmherr
Hello Subir,

Thanks for the response. That was one of the posts I have been trying. After I did that for a few (not all) the mime.4000 messages and ran the omcheck script that scalix generated the queue went down from 2000 or so to 0 in an hour or two. It was fine until this morning (naturally more users doing stuff).

I still have the omtidyallu running, and we've found when we try to shutdown the ld it partially aborts (could be due to the large amount of mail in the queue). Also if we do an omshut it never kills half of the agents (so we are having trouble getting scalix to cleanly shutdown right now). I ended up doing omoff -d0 -a twice and then one more omshut and the agents all finally stopped. omrc complained about db corruption however the scan looked ok so I manually deleted the 4 lock files I found.

The queue is back up to 1155 now since this morning. I'm still seeing OM.MIME 4000 errors as well as:
ERROR Remote Client (U/I Access ) 04.24.07 12:03:50
[SYS 22] Invalid argument
User Name: Paul Guzman / scalixserver, companyname/CN=Paul Guzman
Current errno value: 22
Last Folder: UserFolder - Filing Cab
-> sfl_GetNewBlock
<- sfl_GetNewBlock
-> sfl_GetNewBlock
<- sfl_GetNewBlock
-> sfl_GetNewBlock
<- sfl_GetNewBlock
-> sfl_GetNewBlock
<- sfl_GetNewBlock
-> sfl_GetNewBlock
<- /build/11.0.2/src/lib/ombase/os/os_lseek.c:49[1,22]
<- /build/11.0.2/src/lib/ombase/sfl/sfl_base.c:989[1,22]
<- /build/11.0.2/src/lib/ct/ct_crext.c:88[1,22]
<- /build/11.0.2/src/lib/ct/ct_craext.c:79[1,22]
<- /build/11.0.2/src/lib/ct/ct_atct.c:704[1,22]
<- /build/11.0.2/src/lib/ct/ct_atct.c:1008[1,22]
<- /build/11.0.2/src/lib/ct/ct_atct.c:454[1,22]

I ran a scan on his account but these errors are occurring for a lot of different users.

Posted: Wed Apr 25, 2007 6:52 am
by ScalixSupport
Hi John!

Can you send the result for the command below:

Code: Select all

omvers -v | grep indexer


Thanks,
Subir

local delivery queue growing mail not delivered. bug 15109

Posted: Wed Apr 25, 2007 11:23 am
by jmherr
$ omvers -v | grep indexer
indexer 11.0.2.17

All versions seem to be 11.0.2.17 with the exception of these:
$omvers -v | grep "\*"
sxchkinstances **********
sxlicmgr.py **********
sxlimitcfg.py **********
sxmsusage **********
sxrescmd.py **********
sxresutils.py **********
sxubermgrcfg.py **********
ommaint **********
sxrescmd.pyc **********
sxpluginutils.pyc **********
cleardb **********
omtclsh **********
prdbl **********
sxaa **********
sxcreatesisindex **********
sxcfgplugin.py **********
sxpluginutils.py **********
sxprepsmartcache **********
sxsacdf **********

--------------------
Since we ended up being dead in the water for mail traffic we went ahead and paid for support in order to get someone more experienced than us to help troubleshoot.

It appears we may have run into bug# 15109 which is a known issue (unfortunatly not yet resolved and possible fix in 11.0.4). It could be a user we had (a fake user we used to set up bulletin boards via ldap (we can't have a mail group w/o a member so we used a fake account called "discard" which had a rule to delete all mail that comes to it). The user's waste basket had grown to around 6GB which my attempt to empty via SAC I don't think ever completed.

Basically we think that scalix was trying to do standard maintenance and hit this user which ended up locking the folder causing local.delivery to stall which caused our LD service to hang, but not crash. We've since created a new user for this bb configuration and deleted the large mail account (since we couldn't compact it).

One other trick one of the service guys did was allow 3 local.delivery "services" to run instead of just 1 which will mean even if one crashes/locks up the other 2 can still run in the mean time. I'm not sure what he did to make that change.