I agree, the data seems inconsistent. To try to make sense out of this I used the "sar -A" command to compare before, during, and after an event to see if I could find any correlations.
The other day the slow down started a little around 8:50 AM, and lasted for almost an hour. Responsiveness of the system was about the same before and after the event except that after the event there was more load on the system (more users and more messages per minute)
Here is the output from: sar -A
Code: Select all
proc/s
08:00:01 AM 0.62
09:00:01 AM 0.58
10:00:01 AM 2.35
Average: 0.48
There is a weak inverse correlation with processes created per second. Although the system was running slower so I would expect fewer new process created.
Code: Select all
12:00:01 AM cswch/s
08:00:01 AM 307.03
09:00:01 AM 442.54
10:00:01 AM 587.26
Average: 406.33
No correlation with context switches per second - This seems to rule out the volume of memory swaps as the cause.
Code: Select all
12:00:01 AM CPU %user %nice %system %iowait %idle
08:00:01 AM all 3.36 0.04 2.10 12.66 81.84
09:00:01 AM all 3.37 0.06 3.14 93.42 0.00
10:00:01 AM all 6.03 0.21 5.51 32.16 56.09
Average: all 3.09 0.06 4.36 21.87 70.61
Notice that the %user, %nice, and %system don't change much from 8AM to 9AM. But, while the overall load is heavier at 10AM than 9AM the performance of the system is better at 10AM.
Code: Select all
12:00:01 AM INTR intr/s
08:00:01 AM sum 1214.47
09:00:01 AM sum 1191.98
10:00:01 AM sum 1315.75
Average: sum 1154.81
I believe this shows interrupts per second - again there is a weak inverse correlation, which doesn't make sense unless this is due to the system running slower so fewer interrupts are processed per second.
Code: Select all
12:00:01 AM pswpin/s pswpout/s
08:00:01 AM 0.02 0.00
09:00:01 AM 0.03 0.00
10:00:01 AM 0.00 0.00
Average: 0.30 0.47
Slightly elevated page swaps in per second during the event, but well below the average. which again would indicate that page swapping is not the cause.
Code: Select all
12:00:01 AM frmpg/s bufpg/s campg/s
08:00:01 AM 6.31 3.95 -13.75
09:00:01 AM -12.30 -5.77 18.11
10:00:01 AM 0.16 2.44 -3.08
Average: -0.94 -1.09 0.75
Notice that during the event there was a negative freed memory pages per second (meaning an additional 12.3 pages per second were allocated during the event). During the event we released buffer pages and increased cache pages. I'm not sure what this means, but this shows the strongest correlation so far.
Code: Select all
12:00:01 AM IFACE rxpck/s txpck/s rxbyt/s txbyt/s rxcmp/s txcmp/s rxmcst/s
08:00:01 AM lo 1.25 1.25 132.10 132.10 0.00 0.00 0.00
08:00:01 AM eth0 82.03 139.84 5637.85 193314.43 0.00 0.00 0.00
08:00:01 AM sit0 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:00:01 AM lo 11.02 11.02 13834.38 13834.38 0.00 0.00 0.00
09:00:01 AM eth0 21.72 31.04 5234.00 31502.83 0.00 0.00 0.00
09:00:01 AM sit0 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:00:01 AM lo 7.58 7.58 21201.44 21201.44 0.00 0.00 0.00
10:00:01 AM eth0 113.38 147.01 45511.81 166212.41 0.00 0.00 0.00
10:00:01 AM sit0 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Average: lo 4.20 4.20 6681.51 6681.51 0.00 0.00 0.00
Average: eth0 37.64 51.42 4596.90 67603.34 0.00 0.00 0.00
Average: sit0 0.00 0.00 0.00 0.00 0.00 0.00 0.00
There is no correlation with network activity.
Code: Select all
12:00:01 AM DEV tps rd_sec/s wr_sec/s
08:00:01 AM dev22-64 86.81 792.27 199.41
09:00:01 AM dev22-64 141.51 2450.55 443.91
10:00:01 AM dev22-64 137.23 1664.80 881.61
Average: dev22-64 70.07 1436.01 274.58
The transfers per second and read sectors per second numbers are high during the event. But, the writes are lower. So, there is at least some correlation with disk activity. The interesting thing is that all of this read activity results in very little process activity. (no local deliveries were taking place,and no client access responsiveness)
Code: Select all
12:00:01 AM pgpgin/s pgpgout/s fault/s majflt/s
08:00:01 AM 399.87 266.19 435.14 0.96
09:00:01 AM 1238.93 953.53 464.32 0.09
10:00:01 AM 837.37 551.36 1108.04 0.04
Average: 840.55 330.31 319.63 0.16
Again, a higher number of page in's correspond with the event. But the major faults per second are low indicating that the system is not having to load many memory pages from disk.
Code: Select all
12:00:01 AM kbmemfree kbmemused %memused kbbuffers kbcached kbswpfree kbswpused %swpused kbswpcad
08:00:01 AM 20532 1502592 98.65 89716 1013896 4001440 95124 2.32 17428
09:00:01 AM 3400 1519724 99.78 51832 944604 4003456 93108 2.27 15876
10:00:01 AM 4908 1518216 99.68 102692 818640 4003748 92816 2.27 15904
Average: 92411 1430713 93.93 208877 745044 4029299 67265 1.64 10276
I think the only thing I can conclude from this is that the overall memory utilization is high. But I have noticed that when the server is rebooted it starts with about 750MB free. Over time the free memory drops to the levels we see here. Free memory seems to vary between 3k and 25k. I have not seen any difference in memory utilization during an event.
Code: Select all
12:00:01 AM dentunusd file-sz inode-sz super-sz %super-sz dquot-sz %dquot-sz rtsig-sz %rtsig-sz
08:00:01 AM 4154 11160 5261 0 0.00 0 0.00 0 0.00
09:00:01 AM 1379 14190 4971 0 0.00 0 0.00 0 0.00
10:00:01 AM 4903 14925 6417 0 0.00 0 0.00 0 0.00
Average: 19341 10458 17254 0 0.00 0 0.00 0 0.00
Here the number of unused cache entries drops during the event, but it is not an unreasonable number.
Code: Select all
12:00:01 AM totsck tcpsck udpsck rawsck ip-frag
08:00:01 AM 202 74 10 0 0
09:00:01 AM 314 124 10 0 0
10:00:01 AM 358 145 10 0 0
Average: 180 69 10 0 0
Nothing unusual in the number of sockets in use.
Code: Select all
12:00:01 AM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
08:00:01 AM 1 207 0.42 0.26 0.15
09:00:01 AM 0 262 8.09 9.64 10.63
10:00:01 AM 0 282 0.48 0.77 1.99
Average: 1 197 1.67 1.69 1.66
This is interesting - during the event there are no processes waiting for run time. Also, the number of processes in the process list is not unreasonable. The load average is one of the key indicators that there is a problem. I have seen the iowait time get up in the sixties and seventies but we still get reasonable response times unless the load average starts to rise above 2 or 3.
So, I saw high page-in per second rates, high disk reads per second rates, and we have a number of indications that there is paging going on but nothing to indicate that this is related to the problem. Maybe I just don't see the implications of these statistics, but I can't see any meaningful correlations between the data and the timing of the event.
You also mentioned Tomcat memory usage. I have long suspected that it may have something to do with the problem. Right now the java task is using 26.3% of memory. and I have "niced " the priority down to 12. I think the next time we get a slow-down I will attempt to terminate Tomcat and see if that has any effect.