Disk I/O question

Discuss the Scalix Server software

Moderators: ScalixSupport, admin

Marc
Posts: 30
Joined: Thu Aug 11, 2005 3:19 am

Disk I/O question

Postby Marc » Tue May 22, 2007 3:55 pm

We have constant problems with client timeouts and really slow response from SWA. I know our network isn't that slow, so I checked several other possible causes, and I just found this:

Code: Select all

sx1:~ # iostat -x -d 2 | egrep "Device|sdc"
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc         14.45  43.62 50.78 19.65 1057.99  690.54   529.00   345.27    24.83     1.06   18.84   3.18  22.37
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc          0.00 239.11 161.88 114.85    0.00 3119.31     0.00  1559.65    11.27     0.00    2.87   3.52  97.48
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc          0.00 364.00 209.50 131.50    0.00 3899.00     0.00  1949.50    11.43     0.00    1.91   2.87  97.75
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc          0.00 390.45 172.36 115.08    0.00 3889.95     0.00  1944.97    13.53     0.00    2.18   3.40  97.59
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc          0.00 185.43 106.03 96.48    0.00 2551.26     0.00  1275.63    12.60     0.00    5.59   4.93  99.80
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc          0.00 359.70 214.43 120.90    0.00 3707.46     0.00  1853.73    11.06     0.00    1.91   2.90  97.11
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc          0.00 307.46 191.54 89.55    0.00 3140.30     0.00  1570.15    11.17     0.00    3.05   3.46  97.16
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc          0.00 195.98 148.24 101.01    0.00 2610.55     0.00  1305.28    10.47     0.00    4.36   3.99  99.45
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc          0.00 394.00 218.50 123.00    0.00 3991.00     0.00  1995.50    11.69     0.00    1.52   2.85  97.20
^XDevice:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc          0.00 236.00 97.00 112.00    0.00 3144.50     0.00  1572.25    15.05     0.00    5.73   4.70  98.25
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc          0.00 458.21 214.93 108.46    0.00 4196.02     0.00  2098.01    12.98     0.00    2.00   3.00  97.16
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc          0.50 189.00 159.00 76.00    0.00 2367.00     0.00  1183.50    10.07     0.00    6.35   4.21  98.85


sdc used to be an external SCSI-RAID device - the same type we use as a backend for our NFS servers for ~ 50 users each who run compilers on it all day long, so it's not exactly slow... and we get maybe one mail every five seconds on a really busy day, that's not even enough to make a pocket calculator blink.

So I thought the RAID was just broken and swapped it with a simple, single LVD-disk. Still the same picture.

At about 21:30 localtime, what could possibly cause a load of nearly 100 percent on this disk? Why is it writing about 1 MB/s on average when there's really absolutely nothing happening right now?

Can someone clue me in where to look, cause I really don't know what's going on there...


Marc

Shredder

Postby Shredder » Tue May 22, 2007 4:25 pm

What OS, Version of Scalix are you running?

Is this a server that has been running for a while or a brand new install? Is it an upgrade?

What tasks are running? ie (top)

We would need more info then you have given.

Thanks,
Shredder

Marc
Posts: 30
Joined: Thu Aug 11, 2005 3:19 am

Postby Marc » Tue May 22, 2007 5:00 pm

He Shredder, thanks for your quick reply.

Shredder wrote:What OS, Version of Scalix are you running?


It's SLES 9, Scalix Version 11.0.3.31

Is this a server that has been running for a while or a brand new install? Is it an upgrade?


We tracked it from somewhere around Scalix 9.3. It's in use here for about 1.5 to 2 years.

What tasks are running? ie (top) We would need more info than you have given.


It's a dedicated box. There's absolutely nothing running on it except scalix processes and a lonely sshd for me and two other guys to login.
I'd be happy to provide it, but I don't know what you're looking for. This is a 2 GHZ Xeon box with 4 GB RAM and a dual channel UW controller... what do you need to know?

Marc

Shredder

Postby Shredder » Tue May 22, 2007 5:10 pm

Have you recently upgraded to 11? The upgrade to 11 causes Scalix to reindex the messages as it converts them to a new format. This will take time and I/O depending on how much mail is there.

What processes are running with the most CPU, etc?

Shredder

dougp23
Posts: 229
Joined: Thu Feb 15, 2007 2:42 pm

Postby dougp23 » Tue May 22, 2007 8:49 pm

Do you have any USB things plugged into the server? Especially with SATA drives, which tend to be listed as sdax in the OS, so does a USB thumb-drive, it usually appears as sda1. Just a thought. Have you run a disk check to make sure the disks are healthy? Also, do a top to see what is eating a lot of CPU, it could be that process that is thrashing the disk.

Marc
Posts: 30
Joined: Thu Aug 11, 2005 3:19 am

Postby Marc » Wed May 23, 2007 4:17 am

Shredder wrote:Have you recently upgraded to 11?


Upgrade to 11.0.3.3 was about 4 weeks ago, I'd guess.

Shredder wrote:The upgrade to 11 causes Scalix to reindex the messages as it converts them to a new format. This will take time and I/O depending on how much mail is there.


Do you mean sxmkindex? I did that for all users just after the upgrade...

Shredder wrote:What processes are running with the most CPU, etc?


When I open a mailbox, java and an imapd with my uid. But IO goes straight up, often to > 90%. Here are the first few processes, ordered by CPU time:

Code: Select all

 

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+    TIME COMMAND                                     
 8350 root      34  19 33404  23m 4212 S  0.0  0.6 156:25.96 156:25 omscan.server                               
31716 root      24   0  793m 610m  15m S  0.1 16.6  79:54.38  79:54 java                                         
 8345 scalix    18   2  228m 216m 6100 S  0.0  5.9  23:39.15  23:39 local.delivery                               
   87 root      15   0     0    0    0 S  0.0  0.0  23:08.23  23:08 kswapd0                                     
 8232 scalix    15   0 10820 4480 2908 S  0.0  0.1  15:27.66  15:27 omctmon                                     
 8381 mail      16   0  5456 2400 1632 S  0.1  0.1   7:27.84   7:27 sendmail                                     
 8289 root      15   0 16916 6036 4468 S  0.0  0.2   5:33.77   5:33 omsmtpd                                     
 8385 root      15   0 17160 6404 4824 S  0.0  0.2   5:20.84   5:20 omslapdeng                                   
 2454 root       5 -10     0    0    0 S  0.0  0.0   3:56.40   3:56 xfslogd/0                                   
 8283 root      16   0 10020 3648 2808 S  0.0  0.1   3:05.99   3:05 omdbmon                                     
 8343 scalix    18   2 15680 5512 4144 S  0.0  0.1   2:53.29   2:53 service.router                               
 8288 scalix    16   0 11448 5140 2948 S  0.0  0.1   2:42.52   2:42 mime.control                                 
 8229 scalix    15   0 13896 4480 3132 S  0.0  0.1   2:33.44   2:33 omsessd                                     
11277 scalix    17   0 15512 6760 4872 S  0.0  0.2   1:56.99   1:56 mime.browse                                 
11266 scalix    16   0 15356 6616 4836 S  0.0  0.2   1:55.13   1:55 mime.browse                                 
11274 scalix    17   0 15308 6620 4864 S  0.0  0.2   1:54.92   1:54 mime.browse                                 
11276 scalix    17   0 15736 7028 4884 S  0.0  0.2   1:54.28   1:54 mime.browse                                 
11272 scalix    16   0 15452 6756 4864 S  0.0  0.2   1:53.67   1:53 mime.browse                                 
11271 scalix    17   0 15312 6564 4836 S  0.0  0.2   1:53.62   1:53 mime.browse                                 
11265 scalix    17   0 15204 6496 4840 S  0.0  0.2   1:53.21   1:53 mime.browse                                 
11268 scalix    17   0 15288 6568 4864 S  0.0  0.2   1:53.00   1:53 mime.browse                                 
11267 scalix    17   0 15344 6656 4856 S  0.0  0.2   1:52.03   1:52 mime.browse                                 
11269 scalix    16   0 15524 6800 4872 S  0.0  0.2   1:51.73   1:51 mime.browse                                 
 8727 scalix    17   0 15452 6684 4856 S  0.0  0.2   1:51.39   1:51 mime.browse                                 
11270 scalix    17   0 15364 6668 4868 S  0.0  0.2   1:51.18   1:51 mime.browse                                 
 9306 scalix    16   0 15280 6572 4820 S  0.0  0.2   1:50.42   1:50 mime.browse                                 
 9303 scalix    17   0 15360 6596 4868 S  0.0  0.2   1:50.15   1:50 mime.browse                                 
 9304 scalix    17   0 15328 6600 4864 S  0.0  0.2   1:50.04   1:50 mime.browse                                 
 9305 scalix    17   0 15192 6500 4856 S  0.0  0.2   1:49.57   1:49 mime.browse                                 
 8737 scalix    17   0 15260 6516 4840 S  0.0  0.2   1:49.45   1:49 mime.browse     



dougp23 wrote:Do you have any USB things plugged into the server?


Nothing except a keyboard. The disk is SCSI, not SATA, and smartctl says it's ok. It was basically the same with /var on an external RAID.

dougp23 wrote:Also, do a top to see what is eating a lot of CPU


I can't see anything suspect. CPU usage is mostly < 5% (probably because it's waiting for the disk). The last scalix restart didn't change anything either.


Marc

dboreham
Scalix
Scalix
Posts: 36
Joined: Mon Apr 09, 2007 11:38 am

Postby dboreham » Wed May 23, 2007 11:52 am

Sometimes we see omscan running during work hours (and I see it in your process list above). Try turning it off. The high I/O may go with it. (omscan is inspecting every bit in the message store which may be 10's of Gbytes).

jhamill
Posts: 66
Joined: Thu Dec 01, 2005 5:25 pm

Postby jhamill » Wed May 23, 2007 11:47 pm

dboreham wrote:Sometimes we see omscan running during work hours (and I see it in your process list above). Try turning it off. The high I/O may go with it. (omscan is inspecting every bit in the message store which may be 10's of Gbytes).


we have omscan running a lot during our business day and yes our server seems slow now (11.04).

how do you suggest turning it off rather than just kill?

tenaciousC
Posts: 89
Joined: Thu Mar 30, 2006 5:41 pm
Location: Manchester, UK.

Postby tenaciousC » Thu May 24, 2007 5:47 am

The scalix-postgres service not working properly can give poor swa performace. It might also be worth checking your /var/opt/scalix/??/postgres/pgstartup.log

C

battusai177

Postby battusai177 » Thu May 24, 2007 8:14 am

I had the same problem until I saw this http://bugzilla.scalix.com/show_bug.cgi?id=15242. After changing lucene configuration according to the bug report the load went down. I am getting a couple of a indexer related errors but it's a lot better now.

Marc
Posts: 30
Joined: Thu Aug 11, 2005 3:19 am

Postby Marc » Thu May 24, 2007 8:14 am

Thanks for all the suggestions.

dboreham wrote:Try turning it off. The high I/O may go with it.


This seems to help at least when nobody is logging in or copying/deleting mails. Are there any side-effects when it's turned off?

tenaciousC wrote:checking your /var/opt/scalix/??/postgres/pgstartup.log


I just did that, and there are some errors like 'schema "sx_a22a223be21774254e740e3a1aa6123c" does not exist' and about transaction aborts.

I made a series of snapshots with top and iostat while opening a folder with about 4000 mails in it in thunderbird. The only active process shown is an in.imap4d with my uid and disk usage goes up to > 90% for about 30 seconds. And the strange thing is: it shows only disk writes. If 10 or 20 people try that at the same time...

Are these writes supposed to happen when a folder is opened?


Marc

dboreham
Scalix
Scalix
Posts: 36
Joined: Mon Apr 09, 2007 11:38 am

Postby dboreham » Thu May 24, 2007 9:50 am

Marc wrote:Are these writes supposed to happen when a folder is opened?

These are probably writes to temporary files that are used to hold the
MIME encoded message content. On high usage systems we recommend
that this be mounted on a disk volume with write-back caching, or to a
ramdisk. The path would be : /var/opt/scalix/<instanvce>/s/temp

Marc
Posts: 30
Joined: Thu Aug 11, 2005 3:19 am

Postby Marc » Wed May 30, 2007 10:41 pm

dboreham wrote:
Marc wrote:Are these writes supposed to happen when a folder is opened?

These are probably writes to temporary files that are used to hold the
MIME encoded message content. On high usage systems we recommend
that this be mounted on a disk volume with write-back caching, or to a
ramdisk. The path would be : /var/opt/scalix/<instanvce>/s/temp


The disk already had its write-back cache enabled, so I tried moving the temp dir to a ramfs partition. Like that:

Code: Select all

mount -t ramfs none /mimetmp
chown scalix:scalix /mimetmp
chmod 771 /mimetmp


Now I tried setting UAL_SINGLE_TEMP_DIR=/mimetmp in general.cfg, but this didn't seem to have any effect, there was nothing written to it after a restart. Is there anything else that needs to be changed?


Marc

Marc
Posts: 30
Joined: Thu Aug 11, 2005 3:19 am

Postby Marc » Mon Jun 04, 2007 12:47 pm

battusai177 wrote:I had the same problem until I saw this http://bugzilla.scalix.com/show_bug.cgi?id=15242. After changing lucene configuration according to the bug report the load went down. I am getting a couple of a indexer related errors but it's a lot better now.


Great. Thanks for the link, this seems to help. I also mounted 512MB of tmpfs directly to /var/opt/scalix/??/s/temp, and now it's used. Re-enabling the omscan service still nearly kills the disk... I'll let it run for some time and see if this is a temporary thing.


Marc


Return to “Scalix Server”



Who is online

Users browsing this forum: No registered users and 1 guest

cron