Nightly Snap Backup Fails - Server Crashes

Discuss the Scalix Server software

Moderators: ScalixSupport, admin

lsdcaltan

Nightly Snap Backup Fails - Server Crashes

Postby lsdcaltan » Wed Nov 21, 2007 9:47 am

We seem to be having a problem with the nightly snap and backup process.

This is not a consistent error, it shows up every 3-4 days. The issue begins with the Snap process, then the server loads up to about 15 processor load and maxes out the available RAM, then Scalix crashes out and the server has to be restarted.

Occasionally we get this error when we try to turn the Snap volume off:

Code: Select all

Umounting Scalix Backup Volume
umount: /mnt/sxbackup: not mounted
Removing the snapshot logical volume
  device-mapper ioctl cmd 9 failed: Cannot allocate memory
  Couldn't load device 'vgscalix-sxbackup'.
  Unable to deactivate logical volume "sxbackup"
[root@mail1 ~]# kill -9 12157
[root@mail1 ~]# sxsnapoff
Umounting Scalix Backup Volume
umount: /mnt/sxbackup: not mounted
Removing the snapshot logical volume
  Unable to deactivate logical volume "sxbackup"


This is the error we are trying to troubleshoot:

Code: Select all

Nov 20 21:30:14 mail1 kernel: lvcreate: page allocation failure. order:0, mode:0xd0
Nov 20 21:30:14 mail1 kernel:  [<c013f1bf>] __alloc_pages+0x28b/0x298
Nov 20 21:30:14 mail1 kernel:  [<f88d65e7>] alloc_pl+0x27/0x3d [dm_mod]
Nov 20 21:30:14 mail1 kernel:  [<f88d66c2>] client_alloc_pages+0x15/0x47 [dm_mod]
Nov 20 21:30:14 mail1 kernel:  [<f88d7070>] kcopyd_client_create+0x64/0x9f [dm_mod]
Nov 20 21:30:14 mail1 kernel:  [<f8ac2697>] snapshot_ctr+0x231/0x2b8 [dm_snapshot]
Nov 20 21:30:14 mail1 kernel:  [<f88d3088>] dm_table_add_target+0xfc/0x169 [dm_mod]
Nov 20 21:30:14 mail1 kernel:  [<f88d509b>] populate_table+0x8a/0xaf [dm_mod]
Nov 20 21:30:14 mail1 kernel:  [<f88d50f7>] table_load+0x37/0xf9 [dm_mod]
Nov 20 21:30:14 mail1 kernel:  [<f88d58af>] ctl_ioctl+0xd1/0x144 [dm_mod]
Nov 20 21:30:14 mail1 kernel:  [<f88d50c0>] table_load+0x0/0xf9 [dm_mod]
Nov 20 21:30:14 mail1 kernel:  [<c0164faa>] sys_ioctl+0x227/0x269
Nov 20 21:30:14 mail1 kernel:  [<c02c62a3>] syscall_call+0x7/0xb
Nov 20 21:30:14 mail1 kernel: device-mapper: : Could not create kcopyd client
Nov 20 21:30:14 mail1 kernel:
Nov 20 21:30:15 mail1 kernel: device-mapper: error adding target to table
Nov 20 21:40:44 mail1 sshd(pam_unix)[12325]: session opened for user root by root(uid=0)
Nov 20 21:43:02 mail1 kernel: lvremove: page allocation failure. order:0, mode:0xd2
Nov 20 21:43:02 mail1 kernel:  [<c013f1bf>] __alloc_pages+0x28b/0x298
Nov 20 21:43:02 mail1 kernel:  [<c014e21a>] __vmalloc+0xaf/0xee
Nov 20 21:43:02 mail1 kernel:  [<c014e26f>] vmalloc+0x16/0x19
Nov 20 21:43:02 mail1 kernel:  [<f88d26b1>] dm_vcalloc+0x1d/0x42 [dm_mod]
Nov 20 21:43:02 mail1 kernel:  [<f8ac2253>] init_exception_table+0x18/0x4f [dm_snapshot]
Nov 20 21:43:02 mail1 kernel:  [<f8ac2416>] init_hash_tables+0x88/0xd8 [dm_snapshot]
Nov 20 21:43:02 mail1 kernel:  [<f8ac2641>] snapshot_ctr+0x1db/0x2b8 [dm_snapshot]
Nov 20 21:43:02 mail1 kernel:  [<f88d3088>] dm_table_add_target+0xfc/0x169 [dm_mod]
Nov 20 21:43:02 mail1 kernel:  [<f88d509b>] populate_table+0x8a/0xaf [dm_mod]
Nov 20 21:43:02 mail1 kernel:  [<f88d50f7>] table_load+0x37/0xf9 [dm_mod]
Nov 20 21:43:02 mail1 kernel:  [<f88d58af>] ctl_ioctl+0xd1/0x144 [dm_mod]
Nov 20 21:43:02 mail1 kernel:  [<f88d50c0>] table_load+0x0/0xf9 [dm_mod]
Nov 20 21:43:02 mail1 kernel:  [<c0164faa>] sys_ioctl+0x227/0x269
Nov 20 21:43:02 mail1 kernel:  [<c02c62a3>] syscall_call+0x7/0xb
Nov 20 21:43:02 mail1 kernel: device-mapper: : Unable to allocate hash table space
Nov 20 21:43:02 mail1 kernel:
Nov 20 21:43:02 mail1 kernel: device-mapper: error adding target to table


Any assistance on this would be very helpful, I'm new to this job and new to Scalix and all this experience has taught me is that I'm much happier administering Exchange. I would really appreciate it if someone could help turn my opinion around, I really would prefer to use open source software whenever possible.

Thanks.

Valerion
Scalix Star
Scalix Star
Posts: 2730
Joined: Thu Feb 26, 2004 7:40 am
Location: Johannesburg, South Africa
Contact:

Postby Valerion » Fri Nov 23, 2007 6:15 am

How much memory do you have? When it maxes out, try running top and press "M" (not "m") and see if you can see which process is using so much memory. Also post the output of free when this happens as well as when it's running normally.

Tomcat should be using a significant amount, but not all, so I doubt it's that.

lsdcaltan

Postby lsdcaltan » Fri Nov 23, 2007 7:08 pm

I have 1.5gig of RAM on this box, and there are only about 60 users total.

This is the output of top right now:

Code: Select all

 2978 root      23   0  784m 591m  15m S  0.0 38.9   0:08.73 java
 2824 scalix    18   2 38256  22m 5680 S  0.0  1.5   4:38.06 local.delivery
 2705 scalix    16   0 25412  13m   9m S  0.0  0.9   0:00.26 omsmdm
 2925 apache    15   0 18464  10m 6892 S  0.0  0.7   0:01.70 httpd
 2297 apache    16   0 18460  10m 6892 S  0.0  0.7   0:00.82 httpd
 2926 apache    16   0 18488  10m 6868 S  0.0  0.7   0:01.71 httpd
 2927 apache    15   0 18468  10m 6876 S  0.0  0.7   0:01.65 httpd
 2929 apache    16   0 18452  10m 6880 S  0.0  0.7   0:01.79 httpd
 2930 apache    15   0 18460  10m 6868 S  0.0  0.7   0:01.61 httpd
 2922 apache    16   0 18444  10m 6884 S  0.0  0.7   0:01.58 httpd
 2928 apache    15   0 18444  10m 6880 S  0.0  0.7   0:01.50 httpd
 2924 apache    15   0 18440  10m 6880 S  0.0  0.7   0:01.60 httpd
21671 apache    15   0 18456  10m 6856 S  0.0  0.7   0:00.20 httpd
21684 apache    15   0 18428  10m 6868 S  0.0  0.7   0:00.21 httpd
30449 apache    16   0 18392  10m 6868 S  0.0  0.7   0:00.78 httpd
21685 apache    16   0 18368  10m 6868 S  0.0  0.7   0:00.23 httpd
 2887 root      16   0 18176   9m 6548 S  0.0  0.7   0:00.25 httpd
 7141 postgres  16   0 21312 9896 8588 S  0.0  0.6   0:01.47 postmaster
 8596 60580     16   0 26504 9264 5644 S  0.0  0.6   0:06.74 in.imap41d
 4690 60537     16   0 24740 8816 5288 S  0.0  0.6   0:15.51 ual.remote
 7140 postgres  16   0 21292 8716 7496 S  0.0  0.6   0:00.22 postmaster
 9914 60613     16   0 25604 8616 6128 S  0.0  0.6   0:01.41 in.imap41d
 2836 root      18   0 66988 8440 4428 S  0.0  0.5   0:00.00 pop3.server
 6284 postgres  16   0 21232 8112 6996 S  0.0  0.5   0:00.04 postmaster
 2724 scalix    16   0 23148 7744 3796 S  0.0  0.5   0:24.73 omlicmon
 4744 60597     16   0 22696 7400 4948 S  0.0  0.5   0:06.12 ual.remote
 3801 scalix    21   5 23664 7368 5060 S  0.0  0.5   0:19.05 index.browse
 3800 scalix    21   5 23564 7348 5088 S  0.0  0.5   0:17.45 index.browse
 3748 scalix    20   5 24084 7328 5084 S  0.0  0.5   0:21.03 index.browse
 2700 scalix    15   0 17224 7304 2852 S  0.0  0.5   0:08.95 omctmon
 3747 scalix    22   5 22980 7252 5044 S  0.0  0.5   0:18.42 index.browse
 3802 scalix    20   5 22544 7140 5060 S  0.0  0.5   0:18.10 index.browse
 3788 scalix    20   5 23344 7100 5064 S  0.0  0.5   0:18.74 index.browse
 3803 scalix    21   5 21720 7088 5036 S  0.0  0.5   0:17.92 index.browse
 3763 scalix    21   5 22064 7076 5048 S  0.0  0.5   0:20.01 index.browse
 3765 scalix    21   5 23344 7068 5092 S  0.0  0.5   0:18.21 index.browse
 3799 scalix    21   5 21660 7068 5072 S  0.0  0.5   0:16.28 index.browse
 3789 scalix    20   5 21892 7004 5060 S  0.0  0.5   0:17.25 index.browse
 3790 scalix    20   5 22996 7004 5048 S  0.0  0.5   0:18.48 index.browse
 3749 scalix    21   5 21588 6880 5052 S  0.0  0.4   0:18.04 index.browse
 3792 scalix    21   5 23288 6880 5032 S  0.0  0.4   0:17.74 index.browse
 8792 60580     15   0 24668 6808 4744 S  0.0  0.4   0:01.55 in.imap41d
 3786 scalix    20   5 22556 6772 5036 S  0.0  0.4   0:18.05 index.browse
 3764 scalix    21   5 22892 6768 5032 S  0.0  0.4   0:17.02 index.browse
 4024 60563     15   0 22924 6748 4952 S  0.0  0.4   0:04.04 ual.remote
 3787 scalix    21   5 23088 6708 5048 S  0.0  0.4   0:17.81 index.browse
 3804 scalix    20   5 21424 6676 5016 S  0.0  0.4   0:17.79 index.browse
 3798 scalix    21   5 22832 6668 5040 S  0.0  0.4   0:16.24 index.browse
 3785 scalix    21   5 22908 6660 5044 S  0.0  0.4   0:17.10 index.browse
 6736 postgres  16   0 21172 6520 5468 S  0.0  0.4   0:00.06 postmaster
 2814 scalix    16   0 22180 6296 4876 S  0.0  0.4   0:49.86 unix.out
10168 60613     15   0 21760 6052 4560 S  0.0  0.4   0:00.16 in.imap41d
  771 scalix    16   0 21516 6016 4480 S  0.0  0.4   0:03.27 mime.browse
10167 60613     16   0 21724 5980 4472 S  0.0  0.4   0:00.11 in.imap41d


As you can see, Java is using a very large amount of memory even before the crash state happens. I can't recreate the crash conditions, I have to wait till it happens.

This is the output of free:

Code: Select all

 total       used       free     shared    buffers     cached
Mem:       1554696    1373028     181668          0     154576     369408
-/+ buffers/cache:     849044     705652
Swap:      3148732      10968    3137764


Any help would be greatly appreciated.[/b]

Valerion
Scalix Star
Scalix Star
Posts: 2730
Joined: Thu Feb 26, 2004 7:40 am
Location: Johannesburg, South Africa
Contact:

Postby Valerion » Mon Nov 26, 2007 4:18 am

On a running system this seems not too bad. Would prefer that nothing is swapped, but the amount isn't excessive and shouldn't bother you at all. You may want to reduce the memory for tomcat a tad, but that's preference.

You need to monitor your machine closely and see if any large apps start up (maybe a large DB query or similar) that gets started infrequently. Also check the fatal log in Scalix to see if there's anything else logged.

lsdcaltan

Postby lsdcaltan » Mon Nov 26, 2007 10:31 am

This is the output of the Fatal log for the day of the crash:

Code: Select all

SERIOUS ERROR           Administration(Scalix Admin C) Sat Nov 17 15:03:52 2007
[OM 10270] Process about to terminate due to error.
Signal (Segmentation Violation) trapped by process 2867
Procedure trace follows:
  -> dr_ACIModContextFlags
  <- dr_ACIModContextFlags
  <- ul_getBySXUID
  <- sis_create
  -> sis_create
  -> ul_getBySXUID
  <- ul_getBySXUID
  <- sis_create
  -> sis_connect
  <- sis_connect
  -> sis_reflist
  -> sis_disconnect
  <- sis_disconnect
  <- sis_reflist
  -> sis_free
  -> sis_free
Pid of logging process: 2867


SERIOUS ERROR           Administration(Scalix Admin C) Sat Nov 17 15:03:52 2007
[OM 10272] BACKTRACE:
/opt/scalix/lib/libom_er.so(er_add_backtrace+0xc6)[0x4b2ee6]
/opt/scalix/lib/libom_er.so[0x4b31dd]
/opt/scalix/lib/libom_er.so(er_DumpProcAndExit+0x1f)[0x4b337f]
/lib/tls/libpthread.so.0[0x8128b8]
/lib/tls/libc.so.6[0xc19a9a]
/lib/tls/libc.so.6[0xc19a6b]
/lib/tls/libc.so.6[0xc19a6b]
/lib/tls/libc.so.6[0xc19a4d]
/lib/tls/libc.so.6[0xc19a6b]
/lib/tls/libc.so.6[0xc19a6b]
/lib/tls/libc.so.6[0xc19a6b]
/lib/tls/libc.so.6[0xc19a4d]
/lib/tls/libc.so.6[0xc19a4d]
/lib/tls/libc.so.6[0xc19a4d]
/lib/tls/libc.so.6[0xc19a6b]
/lib/tls/libc.so.6[0xc19a4d]
/lib/tls/libc.so.6[0xc19a6b]
/lib/tls/libc.so.6(qsort+0x10c)[0xc19ccc]
/opt/scalix/bin/sxmkindex[0x8049793]
/opt/scalix/bin/sxmkindex[0x804a457]
/opt/scalix/bin/sxmkindex[0x804a4fa]
/opt/scalix/bin/sxmkindex[0x804ac65]
/lib/tls/libc.so.6(__libc_start_main+0xe3)[0xc04e33]
/opt/scalix/bin/sxmkindex[0x8049611]


Beyond that, I'll keep an eye on it and see what happens. All the regular processes have run over the weekend and the server seems fine, it's definitely not something tied to a scheduled process, but rather a semi-random issue.

dresdn
Posts: 92
Joined: Wed Apr 05, 2006 5:11 pm

Postby dresdn » Fri Nov 30, 2007 2:44 pm

Out of curiosity, what OS and version are you using? I believe there were some serious LVM bugs that caused this type of error in 2.6.16 (I think) and were fixed shortly after.

I had the same problem of the server crashing on the lvcreate and upgrading the kernel fixed the problem.

-Mike

lsdcaltan

Postby lsdcaltan » Fri Nov 30, 2007 2:46 pm

Running Redhat Enterprise Linux 3 on this particular box.

Kernel 2.6.9-5.ELsmp

dresdn
Posts: 92
Joined: Wed Apr 05, 2006 5:11 pm

Postby dresdn » Fri Nov 30, 2007 3:32 pm

Ah yeah, I thought I saw something here before. Did you happen to see if viewtopic.php?t=888 was relevant?

-Mike

Valerion
Scalix Star
Scalix Star
Posts: 2730
Joined: Thu Feb 26, 2004 7:40 am
Location: Johannesburg, South Africa
Contact:

Postby Valerion » Mon Dec 03, 2007 3:59 am

I would suggest you upgrade to RHEL5 at this point anyway. Scalix has deprecated RHEL3 support completely, I am not sure whether they will continue to compile packages for it, but I doubt it seriously. The newer 2.6 kernels are also better at handling memory than the 2.4 series was, so this may or may not solve your problem.


Return to “Scalix Server”



Who is online

Users browsing this forum: Google [Bot] and 1 guest

cron