Page 1 of 1
Help with recovering from failed drive (pretty PLEASE)
Posted: Tue Feb 28, 2006 3:01 am
by pabloa
Our /var drive started to fail. We've gotten a new one but there were some read errors while transferring the data from the bad drive to the new drive. It seems to almost come up OK except for the Queue manager. It complains about a queue file:
SERIOUS ERROR Queue Manager (Queue Manager ) 02.27.06 00:58:50
[OM 28704] Unable to read a pool file header - to short.
File Name: ~/msgpool/39QP
Pid of logging process: 8467
This is scalix 9.4.2 community. Other than some binary stuff (related to 39QP it seems) in log.0 I don't see any other bad output. 12 of the 14 services that usually start up come up.
Can anyone provide any guidance on how to resolve this?
Any help is *greatly* appreciated.
Posted: Tue Feb 28, 2006 7:59 am
by ScalixSupport
That's an ugly one. It pays to have backups ;-)
Try this:
1- move ~/msgpool/* to /tmp/msgpool and restart. If we are lucky, it will be recreated and you are good to go. If not, read on
2 - shut down Scalix, rename /var/opt/scalix to /var/opt/scalixorg
3- install new messagestore thru the installer
4- rename new install /var/oprt/scalix to /var/opt/scalixnew
5- copy /var/opt/scalixnew/msgpool to /var/opt/msgpool
6- start server.
Let me know how that works. If it fails, copy /tmp/msgpool files back into /var/opt/msgpool
Cheers,
Sascha.
Posted: Tue Feb 28, 2006 10:39 am
by pabloa
It recreated the queues properly thank you. And it started running great. So just to give you some reference we lost about 24 files (out of the ~52k in /var/opt/scalix) and about 50mb of data (out of 2.5gb).
The problem is the fun of finding the needles in the haystacks. We already found one (we think) and it caused the whole system to hang.
1) Is there a better way to find these other than stepping on landmines
2) We have the filenames of the files that were readable, will that help us in figuring out what to recreate
3) What can we do when scalix becomes unresponsive. To elaborate a little: omstat -s hangs, but omstat -a runs normally. In strace, I see omstat -s waiting on 'msgrcv()' call.
Your help is appreciated.
Posted: Sun Apr 30, 2006 11:33 pm
by pete
I am having the omstat -s hanging. Also, there are many unix.in process. THis is after a restore from backup. The system was running happily for a while. RHEL 4. Scalix 10.0 (original, not the update).
[root@mta3 sys]# omstat -s
Service Router Started 01:20:25 1
Local Delivery Started 01:20:25 0
Internet Mail Gateway Started 01:20:25 0
(hangs up here)
Lots of errors like this in the log:
ERROR Service Router(Incoming Trans) 05.01.06 01:21:25
[OM 1001] Transaction File record size is out of bounds
-> tf_GetINT32
<- tf_GetINT32
<- tf_ReadRecord 30500 103
-> tf_GetINT32
<- tf_GetINT32
-> tf_ReadRecord
-> tf_GetINT32
<- tf_GetINT32
<- tf_ReadRecord 30500 103
-> tf_GetINT32
<- tf_GetINT32
-> tf_ReadRecord
<- /build/10.0.0.175/src/lib/tf/tf_ReadRec.c:80[3,1001]
<- /build/10.0.0.175/src/lib/tf/tf_ReadRec.c:88[3,1001]
<- /build/10.0.0.175/src/bin/xp/xp_in.c:1080[3,1001]
<- /build/10.0.0.175/src/bin/xp/xp_in.c:1502[3,1001]
Here is the end of strace omstat -s
write(1, "Internet Mail Gateway ", 30Internet Mail Gateway ) = 30
write(1, "Started ", 15Started ) = 15
write(1, "01:20:25 ", 1501:20:25 ) = 15
write(1, "0 ", 100 ) = 10
write(1, "\n", 1
) = 1
lseek(3, 4610, SEEK_SET) = 4610
read(3, "\0\0", 2) = 2
lseek(3, 5122, SEEK_SET) = 5122
read(3, "\0\0", 2) = 2
lseek(3, 5634, SEEK_SET) = 5634
read(3, "\17\0", 2) = 2
lseek(3, 5634, SEEK_SET) = 5634
read(3, "\17\0", 2) = 2
lseek(3, 5634, SEEK_SET) = 5634
read(3, "\17\0", 2) = 2
socket(PF_FILE, SOCK_STREAM, 0) = 4
connect(4, {sa_family=AF_FILE, path="/var/opt/scalix/temp/lic"}, 110) = 0
time(NULL) = 1146454260
write(4, "oXxTHWbLZKtPoVHGkO1+mhxxO9xyC2hj"..., 56) = 56
write(4, "\n", 1) = 1
read(4,
Any clues?
P
Posted: Mon May 01, 2006 10:45 am
by pabloa
Pete:
This is just a guess, but /var/opt/scalix/temp/lic is the named pipe that omlicmon uses. Make sure omlicmon is running. I've seen some other threads discussing omlicmon not running when expected.
Hope that helps.
Posted: Mon May 01, 2006 11:59 am
by pete
Thanks for the pointer. THat helps, but now it is locking up a bit later in the omstat -s, after the Print Service :-
read(4, write(1, "Internet Mail Gateway ", 30Internet Mail Gateway ) = 30
write(1, "Started ", 15Started ) = 15
write(1, "16:50:01 ", 1516:50:01 ) = 15
write(1, "0 ", 100 ) = 10
write(1, "\n", 1
) = 1
lseek(3, 4610, SEEK_SET) = 4610
read(3, "\0\0", 2) = 2
lseek(3, 5122, SEEK_SET) = 5122
read(3, "\0\0", 2) = 2
lseek(3, 5634, SEEK_SET) = 5634
read(3, "\17\0", 2) = 2
lseek(3, 5634, SEEK_SET) = 5634
read(3, "\17\0", 2) = 2
lseek(3, 5634, SEEK_SET) = 5634
read(3, "\17\0", 2) = 2
socket(PF_FILE, SOCK_STREAM, 0) = 4
connect(4, {sa_family=AF_FILE, path="/var/opt/scalix/temp/lic"}, 110) = 0
time(NULL) = 1146499000
write(4, "uac3yTNgIARK34s+vU2THaQKoEklDMJB"..., 56) = 56
write(4, "\n", 1) = 1
read(4,
And the xport.in processes cannot be killed.
Update: - There are now 91 xport.in processes that seem to be doing nothing. Help, please!
Posted: Mon May 01, 2006 3:09 pm
by pete
More info -
It appear the the xport.in process is hanging on an fdatasync() call. Does this help?
old_mmap(0xbc2000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x8000) = 0xbc2000
close(5) = 0
munmap(0xb7dd8000, 35355) = 0
open("/etc/passwd", O_RDONLY) = 5
fcntl64(5, F_GETFD) = 0
fcntl64(5, F_SETFD, FD_CLOEXEC) = 0
fstat64(5, {st_mode=S_IFREG|0644, st_size=1705, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7de0000
read(5, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1705
close(5) = 0
munmap(0xb7de0000, 4096) = 0
fstat64(4, {st_mode=S_IFREG|0660, st_size=0, ...}) = 0
fchown32(4, 100, 101) = 0
fcntl64(4, F_GETFL) = 0x2 (flags O_RDWR)
fstat64(4, {st_mode=S_IFREG|0660, st_size=0, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7de0000
_llseek(4, 0, [0], SEEK_CUR) = 0
read(3, "N:6-A=&EO;G,N8V]M/@T*5&\\Z(\").97A"..., 4096) = 4096
write(4, "\0\274aN\0\0\0\1\0\0\0e\0\0\4L\0\0\0\0\0\0\0(\0\3\0\0\0"..., 3920) = 3920
fdatasync(4
And it never returns
Solved
Posted: Mon May 01, 2006 7:21 pm
by pete
Brain dead RHEL4 LVM snapshots was the cause of the problem. Removed the snapshots and it all works beautifully. Grrrrrrrrr
Posted: Tue May 02, 2006 7:05 am
by ScalixSupport
Good to hear you got it back working. Please also see this thread:
http://www.scalix.com/community/viewtopic.php?t=888
Cheers,
Sascha.