Bunch of issues with Scalix 11.0.1 after upgrade from 10.0.5

thatitguy · Postby **thatitguy** » Fri Feb 23, 2007 10:27 am

Hello all...
I've just upgraded a Scalix 10.0.5 server to 11.0.1, and am experiencing some serious flakiness.
I've searched the forums for the issues and errors I'm seeing and have come up empty, so now I'll post my query here. If I've missed relevant forum entries, feel free to point them out to me and I'll carry one

Anyway... After the upgrade, I'm seeing several issues:
1. Tomcat keeps randomly crapping out and leaving me with no webmail or SAC.
2. ldapmapper has hung 2 times in 36 hours; I have to kill -9 and restart the process.
3. A bunch of scalix Services keep stopping; the daemons are running (omstat -a shows all happiness) but omstat -s shows:

Code: Select all

Service Router                Started        04:51:22       0
Local Delivery                Partially Abor 04:50:49       1268
Internet Mail Gateway         Started        04:51:27       0
Sendmail Interface            Started        04:51:21       0
Local Client Interface        Enabled        05:50:19       0
Remote Client Interface       Enabled        04:51:05       3
Test Server                   Stopped        02.22.07       0
Request Server                Stopped        02.22.07       0
Print Server                  Stopped        02.22.07       0
Directory Synchronization     Stopped        02.22.07       0
Bulletin Board Server         Stopped        02.22.07       0
Background Search Service     Stopped        02.22.07       0
Dump Server                   Stopped        02.22.07       0
CDA Server                    Stopped        02.22.07       0
POP3 interface                Started        04:50:57       0
Omscan Server                 Stopped        02.22.07       0
Archiver                      Stopped        02.22.07       0

I keep trying to restart the individual services and they run for a moment and die.
When this happens, all I can do is stop Scalix altogether and restart it. Then everything's happy for anywhere from an hour to a day.

Background: RHEL4, fully up2date (man do I miss urpmi)

Big ol' beefy Dell server, loads of RAM, RAIDed drives etc.

As a side note, the indexer is still running at 100% CPU even after nearly 3 days. This server has a 60Gb (!) message store, so I'm thinking that it's just ploughing through a crapload of mail.

Finally, omshowlog gives me this:

Code: Select all


ERROR                          Local Delivery(Local Delivery) 02.23.07 04:52:03
[OM 24070] Debug message for Lab use :
ct_convFigaroCRec: Failed to convert CreatorORN to UTF8
Current errno value: 2
  Last Msg Id: 20070223131520.2605D137939(a)SMTPRelay11.na.blackberry.net
  Last Msg DirectRef: 000fd113845139f7


ERROR                          Local Delivery(Local Delivery) 02.23.07 04:52:03
[OM 28875] Attempt to read a block which does not exist from a blocked item.
Current errno value: 2
  Last Msg Id: 20070223131520.2605D137939(a)SMTPRelay11.na.blackberry.net
  Last Msg DirectRef: 000fd113845139f7
        -> sfl_OpenItem
        -> im_ItemRef2FName
        <- im_ItemRef2FName
        -> sfl_OpenSfl
        -> im_OpenItem
        -> im_ItemRef2FName
        <- im_ItemRef2FName
        <- im_OpenItem
        <- sfl_OpenSfl
        <- sfl_OpenItem
        <- im_OpenItem
        -> im_ItemRef2FName
        <- im_ItemRef2FName
        <- /build/11.0.1/src/lib/ombase/sfl/sfl_Blcked.c:1394[100,28875]
        <- /build/11.0.1/src/lib/ombase/sfl/sfl_Blcked.c:1697[100,28875]
        <- /build/11.0.1/src/lib/ct/ct_rdext.c:153[100,28875]


ERROR                          Local Delivery(Local Delivery) 02.23.07 04:52:03
[OM 3539] Content Record 0 in container ~/data/00000h9/003v393:1 could not be upgraded.
Current errno value: 2
  Last Msg Id: 20070223131520.2605D137939(a)SMTPRelay11.na.blackberry.net
  Last Msg DirectRef: 000fd113845139f7


WARNING                        Local Delivery(Local Delivery) 02.23.07 04:52:03
[OM 3543] Failed to upgrade a Content Record to current container format.

Current errno value: 2
  Last Msg Id: 20070223131520.2605D137939(a)SMTPRelay11.na.blackberry.net
  Last Msg DirectRef: 000fd113845139f7
        -> im_ItemRef2FName
        <- im_ItemRef2FName
        -> sfl_OpenSfl
        -> im_OpenItem
        -> im_ItemRef2FName
        <- im_ItemRef2FName
        <- im_OpenItem
        <- sfl_OpenSfl
        <- sfl_OpenItem
        <- im_OpenItem
        -> im_ItemRef2FName
        <- im_ItemRef2FName
        <- /build/11.0.1/src/lib/ombase/sfl/sfl_Blcked.c:1394[100,28875]
        <- /build/11.0.1/src/lib/ombase/sfl/sfl_Blcked.c:1697[100,28875]
        <- /build/11.0.1/src/lib/ct/ct_rdext.c:153[100,28875]
        <- /build/11.0.1/src/lib/ct/ct_upgrade.c:1136[3,3543]


SERIOUS ERROR                  Local Delivery(Local Delivery) 02.23.07 04:52:03
[OM 10270] Process about to terminate due to error.
Signal (Segmentation Violation) trapped by process 29044
Procedure trace follows:
  -> sfl_OpenSfl
  -> im_OpenItem
  -> im_ItemRef2FName
  <- im_ItemRef2FName
  <- im_OpenItem
  <- sfl_OpenSfl
  <- sfl_OpenItem
  <- im_OpenItem
  -> im_ItemRef2FName
  <- im_ItemRef2FName
  <- /build/11.0.1/src/lib/ombase/sfl/sfl_Blcked.c
  <- /build/11.0.1/src/lib/ombase/sfl/sfl_Blcked.c
  <- /build/11.0.1/src/lib/ct/ct_rdext.c
  <- /build/11.0.1/src/lib/ct/ct_upgrade.c
  <- /build/11.0.1/src/lib/ct/ct_pend.c
Current errno value: 2
  Last Msg Id: 20070223131520.2605D137939(a)SMTPRelay11.na.blackberry.net
  Last Msg DirectRef: 000fd113845139f7


SERIOUS ERROR                  Local Delivery(Local Delivery) 02.23.07 04:52:03
[OM 10272] BACKTRACE:
/opt/scalix/lib/libom_er.so(er_add_backtrace+0xc6)[0xbafee6]
/opt/scalix/lib/libom_er.so[0xbb01e6]
/opt/scalix/lib/libom_er.so(er_DumpProcAndExit+0x1f)[0xbb038f]
/lib/tls/libpthread.so.0[0x3a5898]
/opt/scalix/lib/libom_ct.so(PendOpenCtner+0x63c)[0x2fe4fc]
/opt/scalix/lib/libom_ct.so(PendDelete+0x1d1)[0x2fdad9]
/opt/scalix/lib/libom_ct.so(CloseCtner+0x3c2)[0x2e6c49]
/opt/scalix/lib/libom_ct.so(ct_CloseCtner+0x5d)[0x2d55b5]
local.delivery[0x804f01c]
local.delivery[0x8059ac7]
local.delivery[0x80534a6]
local.delivery[0x805cfa9]
local.delivery[0x805eca1]
local.delivery[0x805f949]
/lib/tls/libc.so.6(__libc_start_main+0xd3)[0x1acde3]
local.delivery[0x804dce1]
Current errno value: 2
  Last Msg Id: 20070223131520.2605D137939(a)SMTPRelay11.na.blackberry.net
  Last Msg DirectRef: 000fd113845139f7

Ok... sorry for the long post, but I'm somewhat stumped at this point. Any ideas out there as to WTF is going on?

Thanks in advance!
Rubin

dkelly · Postby **dkelly** » Fri Feb 23, 2007 11:03 am

A quick question...

I know you mentioned RAID but how is the message store mounted ? Is it to block device storage or via NFS ?

Cheers

Dave

florian · Postby **florian** » Fri Feb 23, 2007 11:12 am

Also, did you happen to run omscan to verify message store integrity before the upgrade?

Florian.

thatitguy · Postby **thatitguy** » Fri Feb 23, 2007 11:25 am

The server has one ext3 formatted LVM volume, ~550Gb, mounted as /. It's most definitely NOT mounted NFS

We did not run omscan prior to the upgrade.

Rubin

thatitguy · Postby **thatitguy** » Fri Feb 23, 2007 4:24 pm

More info...

when I run omscan on one of the affected users (read: one of the 2 Blackberry users on the system), I get:

Code: Select all

omscan running on 02.23.07 at 11:50:45.
(host line deleted)
Fix mode requested.

Last omscan tool run on 02.23.07 at 11:50:16; duration 1 minute(s).
Previous server cycle run on 02.20.07 at 15:05:56; duration 98 minute(s).
Current server cycle not started; service reset or delayed.


Active scan option requested.

Scanning file/dir links .... done.

CAUTION: Scanning of message store has started.
         Mounted file/dir links must be maintained during the scan.
         VxFS file system must not be reorganized - see omscan(1M).

Checking/Scanning user trays ....

omscan : [OM 4951]
A serious error has occurred.  Please see the log files.
Event logged on 02.23.07 at 11:50:45.
Owner/Context Info : ~/user02/g0000qh
Additional error info:
omscan : [OM 3457] Container is an old version and needs to be upgraded.

 done.

Am I correct in assuming that this users' mailbox is hosed? Should I do an omcpoutu, clear the mailbox out and omcpinu? Is there a better way to do this or suggestions as to what to do at this point?

Thanks,
Rubin

thatitguy · Postby **thatitguy** » Thu Mar 08, 2007 12:08 pm

It has been my experience in the 3 10.x -> 11.x upgrades that I've done, that Scalix runs incredibly poorly for the first 24-48 hours after the upgrade. I understand that there is a ton of background processing etc. that is taking place, but it would be really helpful if there was a technote somewhere saying that this behaviour is to be exprected, or if it's not, what to do about this issue which I've experienced with every upgrade I've done so far:

Scalix-tomcat dies repeatedly. Yesterday, 2 days after a 10.0.254-11.0.2 upgrade, I could NOT login to either Webmail or SAC without the tomcat services dying. Today, no problem; all is running fine. NOTHING has been changed on the server, and it hasn't been rebooted since the upgrade, and the Scalix services haven't been hard restarted either.
In addition, I've had one site where the ldapmapper process would hang repeatedly, and have to be 'kill -9' ed and restarted (omon slapd).

I chased my tail for several hours yesterday, perusing the forums and Googling, finally gave up, came back today and all is running fine. It seems that the length of the flakiness with Tomcat is directly related to the size of the mailstore on the system: a server I upgraded a few weeks ago with a ~60Gb mail store took nearly a week before ldapmapper and tomcat would stop spontaneously dying off and leaving SWA and SAC users dead in the water.

Hopefully, this note will save some folks some tail chasing, and maybe even prompt an explanation from the Scalix crew as to *why* these services are so flakey post upgrade.

Rubin

Scalix Forums

Bunch of issues with Scalix 11.0.1 after upgrade from 10.0.5

Bunch of issues with Scalix 11.0.1 after upgrade from 10.0.5

Post upgrade observations

Who is online