Disk I/O question

Marc · Postby **Marc** » Tue May 22, 2007 3:55 pm

We have constant problems with client timeouts and really slow response from SWA. I know our network isn't that slow, so I checked several other possible causes, and I just found this:

Code: Select all

sx1:~ # iostat -x -d 2 | egrep "Device|sdc"
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc         14.45  43.62 50.78 19.65 1057.99  690.54   529.00   345.27    24.83     1.06   18.84   3.18  22.37
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc          0.00 239.11 161.88 114.85    0.00 3119.31     0.00  1559.65    11.27     0.00    2.87   3.52  97.48
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc          0.00 364.00 209.50 131.50    0.00 3899.00     0.00  1949.50    11.43     0.00    1.91   2.87  97.75
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc          0.00 390.45 172.36 115.08    0.00 3889.95     0.00  1944.97    13.53     0.00    2.18   3.40  97.59
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc          0.00 185.43 106.03 96.48    0.00 2551.26     0.00  1275.63    12.60     0.00    5.59   4.93  99.80
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc          0.00 359.70 214.43 120.90    0.00 3707.46     0.00  1853.73    11.06     0.00    1.91   2.90  97.11
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc          0.00 307.46 191.54 89.55    0.00 3140.30     0.00  1570.15    11.17     0.00    3.05   3.46  97.16
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc          0.00 195.98 148.24 101.01    0.00 2610.55     0.00  1305.28    10.47     0.00    4.36   3.99  99.45
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc          0.00 394.00 218.50 123.00    0.00 3991.00     0.00  1995.50    11.69     0.00    1.52   2.85  97.20
^XDevice:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc          0.00 236.00 97.00 112.00    0.00 3144.50     0.00  1572.25    15.05     0.00    5.73   4.70  98.25
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc          0.00 458.21 214.93 108.46    0.00 4196.02     0.00  2098.01    12.98     0.00    2.00   3.00  97.16
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc          0.50 189.00 159.00 76.00    0.00 2367.00     0.00  1183.50    10.07     0.00    6.35   4.21  98.85

sdc used to be an external SCSI-RAID device - the same type we use as a backend for our NFS servers for ~ 50 users each who run compilers on it all day long, so it's not exactly slow... and we get maybe one mail every five seconds on a really busy day, that's not even enough to make a pocket calculator blink.

So I thought the RAID was just broken and swapped it with a simple, single LVD-disk. Still the same picture.

At about 21:30 localtime, what could possibly cause a load of nearly 100 percent on this disk? Why is it writing about 1 MB/s on average when there's really absolutely nothing happening right now?

Can someone clue me in where to look, cause I really don't know what's going on there...

Marc

Shredder · Postby **Shredder** » Tue May 22, 2007 4:25 pm

What OS, Version of Scalix are you running?

Is this a server that has been running for a while or a brand new install? Is it an upgrade?

What tasks are running? ie (top)

We would need more info then you have given.

Thanks,
Shredder

Marc · Postby **Marc** » Tue May 22, 2007 5:00 pm

He Shredder, thanks for your quick reply.

Shredder wrote:What OS, Version of Scalix are you running?

It's SLES 9, Scalix Version 11.0.3.31

Is this a server that has been running for a while or a brand new install? Is it an upgrade?

We tracked it from somewhere around Scalix 9.3. It's in use here for about 1.5 to 2 years.

What tasks are running? ie (top) We would need more info than you have given.

It's a dedicated box. There's absolutely nothing running on it except scalix processes and a lonely sshd for me and two other guys to login.
I'd be happy to provide it, but I don't know what you're looking for. This is a 2 GHZ Xeon box with 4 GB RAM and a dual channel UW controller... what do you need to know?

Marc

Shredder · Postby **Shredder** » Tue May 22, 2007 5:10 pm

Have you recently upgraded to 11? The upgrade to 11 causes Scalix to reindex the messages as it converts them to a new format. This will take time and I/O depending on how much mail is there.

What processes are running with the most CPU, etc?

Shredder

dougp23 · Postby **dougp23** » Tue May 22, 2007 8:49 pm

Do you have any USB things plugged into the server? Especially with SATA drives, which tend to be listed as sdax in the OS, so does a USB thumb-drive, it usually appears as sda1. Just a thought. Have you run a disk check to make sure the disks are healthy? Also, do a top to see what is eating a lot of CPU, it could be that process that is thrashing the disk.

Marc · Postby **Marc** » Wed May 23, 2007 4:17 am

Shredder wrote:Have you recently upgraded to 11?

Upgrade to 11.0.3.3 was about 4 weeks ago, I'd guess.

Shredder wrote:The upgrade to 11 causes Scalix to reindex the messages as it converts them to a new format. This will take time and I/O depending on how much mail is there.

Do you mean sxmkindex? I did that for all users just after the upgrade...

Shredder wrote:What processes are running with the most CPU, etc?

When I open a mailbox, java and an imapd with my uid. But IO goes straight up, often to > 90%. Here are the first few processes, ordered by CPU time:

Code: Select all

 

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+    TIME COMMAND                                     
 8350 root      34  19 33404  23m 4212 S  0.0  0.6 156:25.96 156:25 omscan.server                                
31716 root      24   0  793m 610m  15m S  0.1 16.6  79:54.38  79:54 java                                         
 8345 scalix    18   2  228m 216m 6100 S  0.0  5.9  23:39.15  23:39 local.delivery                               
   87 root      15   0     0    0    0 S  0.0  0.0  23:08.23  23:08 kswapd0                                      
 8232 scalix    15   0 10820 4480 2908 S  0.0  0.1  15:27.66  15:27 omctmon                                      
 8381 mail      16   0  5456 2400 1632 S  0.1  0.1   7:27.84   7:27 sendmail                                     
 8289 root      15   0 16916 6036 4468 S  0.0  0.2   5:33.77   5:33 omsmtpd                                      
 8385 root      15   0 17160 6404 4824 S  0.0  0.2   5:20.84   5:20 omslapdeng                                   
 2454 root       5 -10     0    0    0 S  0.0  0.0   3:56.40   3:56 xfslogd/0                                    
 8283 root      16   0 10020 3648 2808 S  0.0  0.1   3:05.99   3:05 omdbmon                                      
 8343 scalix    18   2 15680 5512 4144 S  0.0  0.1   2:53.29   2:53 service.router                               
 8288 scalix    16   0 11448 5140 2948 S  0.0  0.1   2:42.52   2:42 mime.control                                 
 8229 scalix    15   0 13896 4480 3132 S  0.0  0.1   2:33.44   2:33 omsessd                                      
11277 scalix    17   0 15512 6760 4872 S  0.0  0.2   1:56.99   1:56 mime.browse                                  
11266 scalix    16   0 15356 6616 4836 S  0.0  0.2   1:55.13   1:55 mime.browse                                  
11274 scalix    17   0 15308 6620 4864 S  0.0  0.2   1:54.92   1:54 mime.browse                                  
11276 scalix    17   0 15736 7028 4884 S  0.0  0.2   1:54.28   1:54 mime.browse                                  
11272 scalix    16   0 15452 6756 4864 S  0.0  0.2   1:53.67   1:53 mime.browse                                  
11271 scalix    17   0 15312 6564 4836 S  0.0  0.2   1:53.62   1:53 mime.browse                                  
11265 scalix    17   0 15204 6496 4840 S  0.0  0.2   1:53.21   1:53 mime.browse                                  
11268 scalix    17   0 15288 6568 4864 S  0.0  0.2   1:53.00   1:53 mime.browse                                  
11267 scalix    17   0 15344 6656 4856 S  0.0  0.2   1:52.03   1:52 mime.browse                                  
11269 scalix    16   0 15524 6800 4872 S  0.0  0.2   1:51.73   1:51 mime.browse                                  
 8727 scalix    17   0 15452 6684 4856 S  0.0  0.2   1:51.39   1:51 mime.browse                                  
11270 scalix    17   0 15364 6668 4868 S  0.0  0.2   1:51.18   1:51 mime.browse                                  
 9306 scalix    16   0 15280 6572 4820 S  0.0  0.2   1:50.42   1:50 mime.browse                                  
 9303 scalix    17   0 15360 6596 4868 S  0.0  0.2   1:50.15   1:50 mime.browse                                  
 9304 scalix    17   0 15328 6600 4864 S  0.0  0.2   1:50.04   1:50 mime.browse                                  
 9305 scalix    17   0 15192 6500 4856 S  0.0  0.2   1:49.57   1:49 mime.browse                                  
 8737 scalix    17   0 15260 6516 4840 S  0.0  0.2   1:49.45   1:49 mime.browse

dougp23 wrote:Do you have any USB things plugged into the server?

Nothing except a keyboard. The disk is SCSI, not SATA, and smartctl says it's ok. It was basically the same with /var on an external RAID.

dougp23 wrote:Also, do a top to see what is eating a lot of CPU

I can't see anything suspect. CPU usage is mostly < 5% (probably because it's waiting for the disk). The last scalix restart didn't change anything either.

Marc

dboreham · Postby **dboreham** » Wed May 23, 2007 11:52 am

Sometimes we see omscan running during work hours (and I see it in your process list above). Try turning it off. The high I/O may go with it. (omscan is inspecting every bit in the message store which may be 10's of Gbytes).

jhamill · Postby **jhamill** » Wed May 23, 2007 11:47 pm

dboreham wrote:Sometimes we see omscan running during work hours (and I see it in your process list above). Try turning it off. The high I/O may go with it. (omscan is inspecting every bit in the message store which may be 10's of Gbytes).

we have omscan running a lot during our business day and yes our server seems slow now (11.04).

how do you suggest turning it off rather than just kill?

tenaciousC · Postby **tenaciousC** » Thu May 24, 2007 5:47 am

The scalix-postgres service not working properly can give poor swa performace. It might also be worth checking your /var/opt/scalix/??/postgres/pgstartup.log

C

battusai177 · Postby **battusai177** » Thu May 24, 2007 8:14 am

I had the same problem until I saw this http://bugzilla.scalix.com/show_bug.cgi?id=15242. After changing lucene configuration according to the bug report the load went down. I am getting a couple of a indexer related errors but it's a lot better now.

Marc · Postby **Marc** » Thu May 24, 2007 8:14 am

Thanks for all the suggestions.

dboreham wrote:Try turning it off. The high I/O may go with it.

This seems to help at least when nobody is logging in or copying/deleting mails. Are there any side-effects when it's turned off?

tenaciousC wrote:checking your /var/opt/scalix/??/postgres/pgstartup.log

I just did that, and there are some errors like 'schema "sx_a22a223be21774254e740e3a1aa6123c" does not exist' and about transaction aborts.

I made a series of snapshots with top and iostat while opening a folder with about 4000 mails in it in thunderbird. The only active process shown is an in.imap4d with my uid and disk usage goes up to > 90% for about 30 seconds. And the strange thing is: it shows only disk writes. If 10 or 20 people try that at the same time...

Are these writes supposed to happen when a folder is opened?

Marc

dboreham · Postby **dboreham** » Thu May 24, 2007 9:50 am

Marc wrote:Are these writes supposed to happen when a folder is opened?

These are probably writes to temporary files that are used to hold the
MIME encoded message content. On high usage systems we recommend
that this be mounted on a disk volume with write-back caching, or to a
ramdisk. The path would be : /var/opt/scalix/<instanvce>/s/temp

Marc · Postby **Marc** » Wed May 30, 2007 10:41 pm

dboreham wrote:
Marc wrote:Are these writes supposed to happen when a folder is opened?

These are probably writes to temporary files that are used to hold the
MIME encoded message content. On high usage systems we recommend
that this be mounted on a disk volume with write-back caching, or to a
ramdisk. The path would be : /var/opt/scalix/<instanvce>/s/temp

The disk already had its write-back cache enabled, so I tried moving the temp dir to a ramfs partition. Like that:

Code: Select all

mount -t ramfs none /mimetmp
chown scalix:scalix /mimetmp
chmod 771 /mimetmp

Now I tried setting UAL_SINGLE_TEMP_DIR=/mimetmp in general.cfg, but this didn't seem to have any effect, there was nothing written to it after a restart. Is there anything else that needs to be changed?

Marc

Marc · Postby **Marc** » Mon Jun 04, 2007 12:47 pm

battusai177 wrote:I had the same problem until I saw this http://bugzilla.scalix.com/show_bug.cgi?id=15242. After changing lucene configuration according to the bug report the load went down. I am getting a couple of a indexer related errors but it's a lot better now.

Great. Thanks for the link, this seems to help. I also mounted 512MB of tmpfs directly to /var/opt/scalix/??/s/temp, and now it's used. Re-enabling the omscan service still nearly kills the disk... I'll let it run for some time and see if this is a temporary thing.

Marc

Scalix Forums

Disk I/O question

Disk I/O question

Who is online