Queue Manager keeps aborting once every week

vinayakb · Postby **vinayakb** » Wed Aug 08, 2007 12:07 pm

Hello,

I run a Scalix Server with the enterprise license. Every week I find that the Queue Manager aborts with the following message in the logs.

SERIOUS ERROR Queue Manager (Queue Manager ) 08.08.07 02:11:41
[OM 10270] Process about to terminate due to error.
Signal (Segmentation Violation) trapped by process 3578
Procedure trace follows:
<- ql_GetNextMsgDue
-> ql_GetNextMsgDue
<- ql_GetNextMsgDue
-> ql_GetNextMsgDue
<- ql_GetNextMsgDue
-> ql_GetNextMsgDue
<- ql_GetNextMsgDue
-> ql_GetNextMsgDue
<- ql_GetNextMsgDue
-> ql_GetNextMsgDue
<- ql_GetNextMsgDue
-> ql_GetNextMsgDue
<- ql_GetNextMsgDue
<- qm_RespondToExpectantReaders
-> qm_ProcessReleaseMsg
-> ql_AddMsgToMemList

SERIOUS ERROR Queue Manager (Queue Manager ) 08.08.07 02:11:41
[OM 10272] BACKTRACE:
/opt/scalix/lib/libom_er.so(er_add_backtrace+0xc6)[0x421ee6]
/opt/scalix/lib/libom_er.so[0x4221e6]
/opt/scalix/lib/libom_er.so(er_DumpProcAndExit+0x1f)[0x42238f]
/lib/tls/libpthread.so.0[0x3a5898]
/opt/scalix/lib/libom_ql.so[0x6a3a71]
/opt/scalix/lib/libom_ql.so(ql_AddMsgToMemList+0x69)[0x6a3b36]
queue.manager[0x804a7e1]
queue.manager[0x80512e4]
queue.manager[0x804bd37]
queue.manager[0x804d4a7]
queue.manager[0x804dca2]
/lib/tls/libc.so.6(__libc_start_main+0xd3)[0x227de3]
queue.manager[0x8049f85]

How can I debug this?

vinayakb · Postby **vinayakb** » Wed Aug 15, 2007 5:36 am

Do you need any more information? Can someone please suggest how I could get a response to my query?

Thanks

mikethebike · Postby **mikethebike** » Wed Aug 15, 2007 9:54 am

Does it abort at the same time every week? Is it right after a startup?

maybe you could increase the log level for queue manager to see what is happening, and then capture the event?

omconflvl -a qm 9

omshowlog -l9 0s qm > /tmp/qm.log

You may need to set a cron job toi run at, lets say 04:00 as the event log may roll over and you will lose the data

vinayakb · Postby **vinayakb** » Wed Aug 15, 2007 1:18 pm

Its been three weeks now that it aborts every Wednesday at around 12:00 am.

Restarting scalix starts everything up, but within 5-10 minutes, the que manager is dead again.
After 7-8 restarts, the system is operational again.

I am using version 11.1.0.

Thanks

mikethebike · Postby **mikethebike** » Thu Aug 16, 2007 6:30 am

Is there anything being reported in teh event log just before the abort?
is there a ~logs/qmgr.tomb file?

Are you using omrc to start up all the services?

What does omqstat -T show?

I am heading down the route of thinking your message pool is corrupt, but there seems to be no reference in the how-to's etc on how to look at and resolve issue in the message pools or queues. I think these are still relevant in Scalix, so you may want to check out:
http://www.samsungcontact.com/howtos.html

and look at the document titled "Queues and Paralell processing"
You may need to start up scalix, get a dump of the message queue, and manually remove any offending messages.

Mick

vinayakb · Postby **vinayakb** » Fri Aug 17, 2007 1:54 am

I will have to wait till next Wednesday until the crash occurs. I do not have the logs from the last one.

There is no qmgr.tomb file.

Scalix is started using /etc/init.d/scalix start.

The crash occurs every Wednesday morning after the server has run one full week (From the previous crash).

Usually the sequence of events after the crash is

I stop scalix and then restart it.
Everything comes up except for the CDA server. And 10 mins later queue.manager is dead.
I restart, and the same thing happens.

After 2-3 more restarts, CDA server is fine but queue.manager dies in 10 mins.

A couple of restarts later, its all fine until the next week.

I will post the logs next week.

Does this help?

Thanks.

mikethebike · Postby **mikethebike** » Fri Aug 17, 2007 5:27 am

Are you getting any dbvista errors in your fatal log?
Is there anything in your ~logs/ftlvis.log to suggest directory issues?
When CDA does not start, is there a ~sys/omcda.lock file present?

vinayakb · Postby **vinayakb** » Wed Aug 22, 2007 1:35 am

Guys,

Could it have to do with the number of open files in the system. After 6 days of the server being up, the lsof count is 28365.

Is that reasonable?

It looks like processes are just sitting around with open files.

In order to validate my claim, I pre-emptively shut down scalix* and restarted them. The number of open files are 13300.

I am going to track this number every day and see if it grows over time.

I should mention that /proc/sys/fs/file-max yields 787524.

So that should not have been the problem.

Any ideas?
Thanks

mikethebike · Postby **mikethebike** » Wed Aug 22, 2007 5:17 am

did you have the same problems when restarting last night?
Did you manage to get any info from the event log?

vinayakb · Postby **vinayakb** » Thu Aug 23, 2007 2:04 am

There was no problem restarting scalix. It was absolutely smooth.

The event log had nothing in there like the entries we saw before.

Scalix Forums

Queue Manager keeps aborting once every week

Queue Manager keeps aborting once every week

Who is online