Page 1 of 1

Scalix server slow with High IO/Wait

Posted: Tue Dec 11, 2007 10:38 am
by jskubal
We have been running 10.0.5 for quite some time without any problems. Recently I have seen the IO/wait on the server jump into the 96% to 98% range. This goes on for up to a couple of hours, and then it frees up without any apparent cause. The server runs so slowly while this is happening that our users assume that the server has crashed.

During this time the Local Delivery queue grows. I have shut off the incoming mail relay, and watched the local queue process at the rate of less than 1 message every 5 minutes.

Also, I have tried stopping the Local Delivery service and the OMScan server (because I saw that running). Nothing seems to help. I have gotten it down to just "ual.remote" and "in.imap41d" task running with a "D" status.

Also, on the "top" command the load average normally runs around .02 to .05. During one of these incidents the load average spikes to 20 to 30. (in one incident it went to 130).

I would appreciate any help you could give me, I can't think of anything else to try.

Posted: Tue Dec 11, 2007 4:14 pm
by mikevl
How many users do you have?
How many messages a day (approx) deos the Scalix server service?
How is your disk subsystem configured?

Mike

Posted: Tue Dec 11, 2007 4:45 pm
by jskubal
Hi Mike,
We have about 75 active users, normally I see anywhere from 50 to 65 users connected at peak times of the day.

Our mail activity is less than 2,000 emails per day (in and out), excluding spam. Spam accounts for another 8,000 to 10,000 incoming messages each day.

I have a 466 GB ST3500630A disk for /var/opt/, a 373 GB HDS724040KLAT80 disk for /bak/, and a 38 GB ST340014A disk for the operating system (RHEL3).

John

Posted: Tue Dec 11, 2007 5:36 pm
by mikevl
Hi John

It is my guess that your disk subsystem is not optimal for your requirements.
A quick search of ST3500630A shows that to be an ATA 100 interface. Scalix relies on good I/O it is Scalix's only requirement really. IE don't require much CPU, don't require much memory but I/O is king. Whith that many users and that many messages I would think that the min specification you use would be SATA2 I/O. You could use RAID1 if you want but RAID1+0 give the best performance.

SAS drives in the above configurations give the best performance of all but are very expensive. SATA should do fine.

Mike

Posted: Tue Dec 11, 2007 5:43 pm
by jskubal
Mike,
My last post got me thinking, and I think this might be relevant.

I normally see only a few messages in the local delivery queue at any one time. But when one of these slow-down's happens I have always found a large number of spam emails that appear to have arrived in a short time. Typically I might get between 150 to 800 spam emails in a matter of a few minutes.

When the server is running properly I've watched the spam messages being delivered to the filter mail box and have seen it process as many as 30 messages per minute. But, during a slow-down I might see a single message take up to 5 minutes to deliver.

Could it be that processing a large number of emails for a single mailbox causes the local delivery service to lock-up, or possible to begin to thrash?

Given your comments in your last post - does the disk access requirement increase in a non-linear fashion with higher message volume for a given mailbox?

John

Posted: Tue Dec 11, 2007 5:52 pm
by mikevl
Hi John

The other option is to filter your spam at the gateway. There are may products that do this both commercial and open source. IPCOP is one that others have had success with. but there are many to choose from.

Mike

Posted: Wed Dec 12, 2007 3:49 am
by Valerion
Your problem here is probably disk thrashing. Under high load (lots of incoming emails) each new email needs to be written to disk as it enters the queues the first time. If the load on this is very high you will probably end up with a IO subsystem not capable of handling the load from both the incoming mails and the disk pressure from the SR/Archiver/LD/Internet mail gateway trying to handle all of that. I think it may be worthwhile to do at least some filtering before it reaches the server, to alleviate the load of all this. Or re-spec your disks to keep up.

You can (if you have more than one core) try to increase the number of workers on a queue, resulting in a faster processing of the queue, but that will increase the CPU usage, memory usage and the disk IO all at once. One worker should easily keep up with that kind of load.

Posted: Tue Feb 12, 2008 3:57 pm
by jskubal
I have a filter installed in a qmail server that eliminates the 8,000 to 10,000 daily spam messages. In the last 4 days the total messages processed through the Scalix Local Delivery queue is down to 8,652 (or around 2,200 messages per day).

Using the SAR command I see the following when I run my daily backup.

Code: Select all

04:30:01 AM       CPU     %user     %nice   %system   %iowait     %idle
04:40:01 AM       all      1.26      0.00      0.55      0.04     98.15
04:50:01 AM       all      1.68      0.01      0.90      1.93     95.48
05:00:01 AM       all      1.64      0.02      1.02      0.88     96.43
05:10:04 AM       all      7.21      0.00     13.37     79.42      0.00
05:20:02 AM       all      3.50      0.01     64.77     31.73      0.00
05:30:02 AM       all      4.66      0.00     58.61     36.73      0.00
05:40:02 AM       all      6.22      0.00     73.89     19.88      0.00
05:50:03 AM       all      5.98      0.00     77.48     16.54      0.00
06:00:03 AM       all      8.97      0.01     78.63     12.39      0.00
06:10:02 AM       all      6.23      0.01     83.12     10.64      0.00
06:20:02 AM       all      5.92      0.07     10.64     83.36      0.01


Notice that when the IOWait increases at 5:10:01 the %system value also increases. This is about what I would expect to see in this situation. I think the IOWait is a little high but it doesn't seem to be causing a problem.


When I look at the sar output later in the day when one of the slow-down events happens I see:

Code: Select all

07:40:01 AM       CPU     %user     %nice   %system   %iowait     %idle
07:50:01 AM       all      1.43      0.01       0.41      0.32     97.83
08:00:01 AM       all      1.18      0.01       0.30      0.10     98.40
08:10:01 AM       all      8.30      0.01       2.44     89.26      0.00
08:20:02 AM       all      2.76      0.02       2.91     94.31      0.00
08:30:01 AM       all      2.56      0.01       2.72     94.71      0.00


Notice how the IOwait jumps, but all other activity remains low. At 8:00 AM the number of messages in the Local Delivery queue is 2. Over the next hour the queue grew to 56 messages. Also during this time the Total Messages Processed in the Local Delivery queue did not change. I don't believe this is a problem with incoming message volume. I say this because all the system did with the incoming messages over that hour was to post them to the queue.

I should mention that 8:00AM is when most of our users are just arriving. The Remote Client Interface count (using /etc/init.d/scalix status) was 35 at 8:00AM no new users could log in during the slow-down which continued for about an hour.

Even when the system is running fairly well the io-wait is high. But even under these conditions the server can deliver messages as fast as they come in. In fact I very rarely see a message in the queue waiting for delivery. Here is an example of a good sized load on the system (75 users, 200 messages per hour)

Code: Select all

02:30:01 PM       CPU     %user     %nice   %system   %iowait     %idle
02:40:01 PM       all      5.11      0.32      5.58     60.17     28.82
02:50:01 PM       all      5.41      0.20      5.72     43.41     45.27
03:00:01 PM       all      4.71      0.18      4.60     27.46     63.05
03:10:01 PM       all      6.15      0.23      5.53     16.82     71.27
03:20:01 PM       all      4.86      0.29      8.07     24.84     61.94

Also, I ran hdparm -tT /dev/hdd command while there were no users in the system and got the following:

Timing cached reads: 2180 MB in 2.00 seconds = 1089.08 MB/sec
Timing buffered disk reads: 212 MB in 3.01 seconds = 70.35 MB/sec

Under moderate / normal load it looks like this:
Timing cached reads: 2112 MB in 2.00 seconds = 1054.05 MB/sec
Timing buffered disk reads: 204 MB in 3.02 seconds = 67.63 MB/sec

Running the same command during the slow-down gave me this:
Timing cached reads: 1974 MB in 2.00 seconds = 987.16 MB/sec
Timing buffered disk reads: 4 MB in 3.93 seconds = 1.02 MB/sec

At the same time, the "top" command shows a ual.remote task and a couple of in.imap41d tasks (all with a status of "D"). The only thing with an "R" status is the top task itself.

These slow-down events typically last from an hour to an hour-and-a-half. During that time the system seems to seize-up. Nothing gets processed, and then all of a sudden everything is back to normal, and the messages that had queued up over the last hour process through in a matter of seconds.

I understand the comments about thrashing disk, but I don't think that theory explains this behavior.

Is it at all possible that this is related to the size of our user's mail folders? Looking at the admin screen I see that over 50 of our user mailboxes are under 2 Gig. I have 14 users between 2 Gig and 5 Gig, 4 users between 5 and 8 Gig, 1 user at 8.2 Gig, 1 user at 12.9 Gig, and 1 user at 17.3 Gig.

Posted: Wed Feb 13, 2008 3:33 am
by Valerion
Offhand I can't think of anything that would explain your behaviour. It's possible you may need to have Scalix engineers take a look to see where the bottlenecks are. The figures you quoted seems wrong to me, you should get more consistent behaviour.

Could it be that your RAM is getting exhausted at times, possibly due to a large load on Tomcat? That would both shrink your disk cache and cause your machine to swap, increasing your disk IO.

I have seen slowdowns when maintenance apps are run (especially omscan), or when the indexer gets VERY busy. The newer Scalix releases does schedule omscan for off-peak times, though.

When it seems to hang, are there any processes in top that seem to take up lots of CPU time?

Posted: Wed Feb 13, 2008 11:57 am
by netcomrade
your hdparam results are probably not giving you a good idea.
run some dd tests (for at least 5-10mins)
we've had performance increase dramatically when we moved to scsi disks that could do 50megs sustained writes.
however, I do agree, that no mail servers should generate this much IO

-a

Posted: Wed Feb 13, 2008 3:39 pm
by jskubal
I agree, the data seems inconsistent. To try to make sense out of this I used the "sar -A" command to compare before, during, and after an event to see if I could find any correlations.

The other day the slow down started a little around 8:50 AM, and lasted for almost an hour. Responsiveness of the system was about the same before and after the event except that after the event there was more load on the system (more users and more messages per minute)
Here is the output from: sar -A

Code: Select all

              proc/s
08:00:01 AM      0.62
09:00:01 AM      0.58
10:00:01 AM      2.35
Average:         0.48
There is a weak inverse correlation with processes created per second. Although the system was running slower so I would expect fewer new process created.

Code: Select all

12:00:01 AM   cswch/s
08:00:01 AM    307.03
09:00:01 AM    442.54
10:00:01 AM    587.26
Average:       406.33
No correlation with context switches per second - This seems to rule out the volume of memory swaps as the cause.

Code: Select all

12:00:01 AM       CPU     %user     %nice   %system   %iowait     %idle
08:00:01 AM       all      3.36      0.04      2.10     12.66     81.84
09:00:01 AM       all      3.37      0.06      3.14     93.42      0.00
10:00:01 AM       all      6.03      0.21      5.51     32.16     56.09
Average:          all      3.09      0.06      4.36     21.87     70.61
Notice that the %user, %nice, and %system don't change much from 8AM to 9AM. But, while the overall load is heavier at 10AM than 9AM the performance of the system is better at 10AM.

Code: Select all

12:00:01 AM       INTR       intr/s
08:00:01 AM       sum   1214.47
09:00:01 AM       sum   1191.98
10:00:01 AM       sum   1315.75
Average:          sum   1154.81
I believe this shows interrupts per second - again there is a weak inverse correlation, which doesn't make sense unless this is due to the system running slower so fewer interrupts are processed per second.

Code: Select all

12:00:01 AM  pswpin/s pswpout/s
08:00:01 AM      0.02      0.00
09:00:01 AM      0.03      0.00
10:00:01 AM      0.00      0.00
Average:         0.30      0.47
Slightly elevated page swaps in per second during the event, but well below the average. which again would indicate that page swapping is not the cause.

Code: Select all

12:00:01 AM   frmpg/s   bufpg/s   campg/s
08:00:01 AM      6.31      3.95    -13.75
09:00:01 AM    -12.30     -5.77     18.11
10:00:01 AM      0.16      2.44     -3.08
Average:         -0.94     -1.09      0.75
Notice that during the event there was a negative freed memory pages per second (meaning an additional 12.3 pages per second were allocated during the event). During the event we released buffer pages and increased cache pages. I'm not sure what this means, but this shows the strongest correlation so far.


Code: Select all

12:00:01 AM     IFACE      rxpck/s   txpck/s    rxbyt/s    txbyt/s    rxcmp/s    txcmp/s  rxmcst/s
08:00:01 AM        lo         1.25      1.25     132.10       132.10      0.00      0.00      0.00
08:00:01 AM      eth0        82.03    139.84     5637.85   193314.43      0.00      0.00      0.00
08:00:01 AM      sit0         0.00      0.00       0.00         0.00      0.00      0.00      0.00

09:00:01 AM        lo        11.02     11.02   13834.38     13834.38      0.00      0.00      0.00
09:00:01 AM      eth0        21.72     31.04    5234.00     31502.83       0.00      0.00      0.00
09:00:01 AM      sit0         0.00      0.00       0.00         0.00      0.00      0.00      0.00

10:00:01 AM        lo         7.58      7.58   21201.44     21201.44      0.00      0.00      0.00
10:00:01 AM      eth0       113.38    147.01   45511.81    166212.41      0.00      0.00      0.00
10:00:01 AM      sit0         0.00      0.00       0.00         0.00      0.00      0.00      0.00

Average:           lo         4.20      4.20     6681.51      6681.51      0.00      0.00      0.00
Average:         eth0        37.64     51.42     4596.90     67603.34       0.00      0.00      0.00
Average:         sit0         0.00      0.00        0.00         0.00       0.00      0.00      0.00
There is no correlation with network activity.

Code: Select all

12:00:01 AM       DEV       tps    rd_sec/s    wr_sec/s
08:00:01 AM  dev22-64      86.81    792.27      199.41
09:00:01 AM  dev22-64     141.51   2450.55     443.91
10:00:01 AM  dev22-64     137.23   1664.80     881.61
Average:     dev22-64      70.07   1436.01     274.58
The transfers per second and read sectors per second numbers are high during the event. But, the writes are lower. So, there is at least some correlation with disk activity. The interesting thing is that all of this read activity results in very little process activity. (no local deliveries were taking place,and no client access responsiveness)


Code: Select all

12:00:01 AM  pgpgin/s   pgpgout/s     fault/s    majflt/s
08:00:01 AM     399.87       266.19     435.14      0.96
09:00:01 AM   1238.93       953.53     464.32      0.09
10:00:01 AM     837.37       551.36   1108.04      0.04
Average:        840.55        330.31    319.63      0.16
Again, a higher number of page in's correspond with the event. But the major faults per second are low indicating that the system is not having to load many memory pages from disk.

Code: Select all

12:00:01 AM kbmemfree kbmemused  %memused kbbuffers  kbcached kbswpfree kbswpused  %swpused  kbswpcad
08:00:01 AM    20532  1502592    98.65    89716   1013896   4001440     95124      2.32     17428
09:00:01 AM      3400  1519724    99.78    51832    944604   4003456     93108      2.27     15876
10:00:01 AM      4908  1518216    99.68   102692    818640   4003748     92816      2.27     15904
Average:        92411  1430713    93.93   208877    745044   4029299     67265      1.64     10276
I think the only thing I can conclude from this is that the overall memory utilization is high. But I have noticed that when the server is rebooted it starts with about 750MB free. Over time the free memory drops to the levels we see here. Free memory seems to vary between 3k and 25k. I have not seen any difference in memory utilization during an event.

Code: Select all

12:00:01 AM dentunusd   file-sz  inode-sz  super-sz %super-sz  dquot-sz %dquot-sz  rtsig-sz %rtsig-sz
08:00:01 AM      4154     11160      5261         0      0.00         0      0.00         0      0.00
09:00:01 AM      1379     14190      4971         0      0.00         0      0.00         0      0.00
10:00:01 AM      4903     14925      6417         0      0.00         0      0.00         0      0.00
Average:        19341     10458     17254         0      0.00         0      0.00         0      0.00
Here the number of unused cache entries drops during the event, but it is not an unreasonable number.

Code: Select all

12:00:01 AM    totsck    tcpsck    udpsck    rawsck   ip-frag
08:00:01 AM       202        74        10         0         0
09:00:01 AM       314       124        10         0         0
10:00:01 AM       358       145        10         0         0
Average:          180        69        10         0         0
Nothing unusual in the number of sockets in use.

Code: Select all

12:00:01 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
08:00:01 AM         1       207      0.42      0.26      0.15
09:00:01 AM         0       262      8.09      9.64     10.63
10:00:01 AM         0       282      0.48      0.77      1.99
Average:            1       197      1.67      1.69      1.66
This is interesting - during the event there are no processes waiting for run time. Also, the number of processes in the process list is not unreasonable. The load average is one of the key indicators that there is a problem. I have seen the iowait time get up in the sixties and seventies but we still get reasonable response times unless the load average starts to rise above 2 or 3.

So, I saw high page-in per second rates, high disk reads per second rates, and we have a number of indications that there is paging going on but nothing to indicate that this is related to the problem. Maybe I just don't see the implications of these statistics, but I can't see any meaningful correlations between the data and the timing of the event.

You also mentioned Tomcat memory usage. I have long suspected that it may have something to do with the problem. Right now the java task is using 26.3% of memory. and I have "niced " the priority down to 12. I think the next time we get a slow-down I will attempt to terminate Tomcat and see if that has any effect.

Posted: Wed Feb 13, 2008 4:36 pm
by jskubal
Sorry, Valerion I didn't address a couple of your comments.
Valerion wrote:I have seen slowdowns when maintenance apps are run (especially omscan), or when the indexer gets VERY busy. The newer Scalix releases does schedule omscan for off-peak times, though.


Early on in this problem, I saw omscan server running, and thought it might be contributing to the problem. I ran an experiment where I monitored the system performance while omscan was running. I found that it consumed most of the available cpu time, but didn't seem to hurt our client responsiveness. I let the service run until a slow-down event started. I stopped the omscan service at this point, but it had no effect. I then left the service off, and a few hours later we had another slow-down event.

Valerion wrote:When it seems to hang, are there any processes in top that seem to take up lots of CPU time?


During an event, I usually see anywhere from 1 to 4 user related processes (in.imap41d or ual.remote). Occasionally I will see a kjournald task or a omctmon task or a mime.browse task. In all, there will be 3 to 5 tasks with each task using .3% to 1.5% memory and cpu% for each one is always less than 1%. The status column on all of the Scalix tasks is "D". Once in a while a system task like kjournald will have a status of "S". About the only task that is ever running (status = "R") is the TOP command.

Posted: Thu Feb 14, 2008 3:56 am
by Valerion
This is very strange. I would suggest someone close and knowledgeable about such things take a look at your machine.

Posted: Sat Feb 16, 2008 1:28 am
by mikevl
Hi

Still..... Running 75 active users on a IDE hard disk is a big ask!!!!!

Is all your spam being removed by the gateway now or is some of that stil being done by the Scalix server.

I noticed in one of the sar reports that when wait was hig CPU idel was 0. Something is certainly working hard.

what is in

mailq


Mike v