Diagnosing Problems with Scalix Web Access
SWA depends on several major subsystems for proper operation. These include the Apache web server, the Tomcat application server running several web applications, Scalix LDAP service and the primary mail store which is accessed via IMAP. When problems are experienced by SWA users that are systemic in nature (that is, not a specific error related to performing one specific operation or referencing one specific message) it can be hard to pin down the source of the trouble. The many interacting subsystems, each with its own set of logs and configuration, present significant problem diagnosis challenges. The purpose of this document is to provide guidance on how to proceed and to collect together all the techniques that have proven useful in collecting diagnostic information from production SWA deployments.
Subsystem Connectivity Requirements
The various subsystems that comprise SWA need to have certain TCP ports open between each other. For example the SWA server needs to have IMAP (port 143) access to every user's message store server. The presence of exceptions like this in the web application logs is a sign that there is a connectivity problem between subsystems (network outages or a mis-configured firewall):
2008-03-14 05:00:56,380 WARN [AddressBook.lookupAttributes:169] Communication failure with LDAP server javax.naming.CommunicationException: server.example.com:389 [Root exception is java.net.ConnectException: Connection timed out]
Problem Diagnosis Methodology
The best approach when presented with a problem reported by SWA users is to begin at the client and work progressively 'inwards' towards the IMAP service, examining service health at each subsystem along the way. Techniques for the analysis of each subsystem are discussed below:
System Resource Usage
Before investigating specific Scalix subsystems, it is always useful to gather an overview of system health and utilization. This should be done for each system that is involved in processing SWA operations (for example both front-end web application machines, and back-end IMAP servers). The Linux 'top' utility displays CPU utilization (and memory usage per-process if the 'M' sort order is selected). iostat displays a view of disk utilization:
[root@goat logs]# iostat -x 1 Linux 2.6.18-53.1.4.el5 (goat.mtbrook.bozemanpass.com) 02/20/2009 avg-cpu: %user %nice %system %iowait %steal %idle 0.24 0.00 1.03 0.05 0.00 98.68 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.02 1.46 0.11 1.44 2.02 23.20 16.26 0.00 1.27 0.54 0.08 dm-0 0.00 0.00 0.11 2.87 1.87 22.96 8.33 0.02 5.24 0.28 0.08 dm-1 0.00 0.00 0.02 0.03 0.15 0.24 8.00 0.00 21.17 0.29 0.00 hdc 0.00 0.00 0.00 0.00 0.00 0.00 15.20 0.00 1.80 1.30 0.00
Pay particular attention to the last column (%util). This shows the proportion of the sample period (in this case one second) that the disk was busy. A value near 100% means that the disk is saturated and performance problems should be expected in that case. iostat is part of the 'sysstat' rpm package.
First check that Apache is running. Use:
Then verify that Apache is serving basic static web pages. Point a browser at :
This should display the default Apache home page, or if Apache has been configured to do so, a redirect to /webmail You can also use the telnet command to verify that Apache is accepting connections on ports 80 and 443:
telnet <server> 80
Next, examine Apache's log files. These are stored at /var/log/httpd:
[root@goat bin]# ls -lt /var/log/httpd total 5268 -rw-r--r-- 1 root root 1173498 Feb 20 00:08 access_log -rw-r--r-- 1 root root 2458 Feb 20 00:06 error_log -rw-r--r-- 1 root root 316 Feb 15 04:02 error_log.1 -rw-r--r-- 1 root root 4015 Feb 8 04:02 error_log.2 -rw-r--r-- 1 root root 4048819 Feb 5 02:53 access_log.1 -rw-r--r-- 1 root root 947 Feb 1 04:02 error_log.3 -rw-r--r-- 1 root root 125984 Jan 31 16:25 access_log.2 -rw-r--r-- 1 root root 739 Jan 28 23:00 access_log.3 -rw-r--r-- 1 root root 316 Jan 25 04:02 error_log.4 -rw-r--r-- 1 root root 739 Jan 14 11:36 access_log.4
If SSL is used there will be a second set of log files with the ssl_ prefix. access_log and ssl_access_log contain a log line for each completed request made by clients. This information can be useful to verify that clients are successfully contacting Apache, and to see what type of requests they are making. error_log and ssl_error_log will generally be empty on a server that has no problems, except for the occasional message like these:
[Sun Feb 15 04:02:04 2009] [notice] Apache/2.2.3 (CentOS) configured -- resuming normal operations [Thu Feb 19 21:17:49 2009] [error] [client 188.8.131.52] File does not exist: /var/www/html/favicon.ico [Thu Feb 19 21:23:50 2009] [error] [client 184.108.40.206] File does not exist: /var/www/html/favicon.ico [Thu Feb 19 21:46:32 2009] [error] [client ::1] Directory index forbidden by Options directive: /var/www/html/
If there are large numbers of messages, or messages indicating errors have occurred , these may be useful in diagnosing the problem and should be retained for analysis. For example, here are some messages showing that Tomcat was unavailable:
[Wed Feb 04 03:35:37 2009] [error] proxy: AJP: disabled connection for (xxxxxxx) [Wed Feb 04 03:48:18 2009] [error] (111)Connection refused: proxy: AJP: attempt to connect to xx.xx.xx.xx:8009 (xxxxxxx) failed [Wed Feb 04 03:48:18 2009] [error] ap_proxy_connect_backend disabling worker for (xxxxxxx) [Wed Feb 04 03:48:18 2009] [error] proxy: AJP: failed to make connection to backend: xxxxxxx [Wed Feb 04 03:49:59 2009] [error] (111)Connection refused: proxy: AJP: attempt to connect to xx.xx.xx.xx:8009 (xxxxxxx) failed [Wed Feb 04 03:49:59 2009] [error] ap_proxy_connect_backend disabling worker for (xxxxxxx) [Wed Feb 04 03:49:59 2009] [error] proxy: AJP: failed to make connection to backend: xxxxxxx
The most common types of Apache error seen are:
# Failure to contact Tomcat via AJP, meaning that Tomcat is either down or unresponsive. # Connection overlimit errors, meaning that for some reason operations are taking a long time to process, leading to more and more connections accumulating under high load. # Operations errors in Apache 2.0 (RHEL4) AJP proxy code.
It's useful if any error messages can be correlated in time with the specific problem under investigation having occurred. There may be error messages present that have nothing to do with the problem at hand. Check the number of Apache processes running. A large number (hundreds) would be unusual. Use:
ps -ef | grep httpd | wc -l
Also check the number of sockets open on the Apache ports. Use:
netstat --numeric-ports -t -p | grep ":80 "
netstat --numeric-ports -t -p | grep ":443 "
Large numbers of open connections (more than one per active user), or large numbers of sockets in unconnected states (TIME_WAIT etc) might indicate trouble.
First check that Tomcat is running:
[root@goat /]# /etc/init.d/scalix-tomcat status Instance (goat) is not running
You should also see the Tomcat process (its executable is 'java'). Its pid will be useful later:
[root@goat /]# ps -ef | grep java root 13136 1 0 Feb05 pts/3 00:20:43 /usr/java/jre1.5.0_13/bin/java -server -Djava.net.preferIPv4Stack=true -Xms256m -Xmx256m -Dscalix.instance=/var/opt/scalix/gt -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.util.logging.config.file=/var/opt/scalix/gt/tomcat/conf/logging.properties -Djava.endorsed.dirs=/var/opt/scalix/gt/tomcat/common/endorsed -classpath /usr/java/jre1.5.0_13/lib/tools.jar:/var/opt/scalix/gt/tomcat/bin/bootstrap.jar:/var/opt/scalix/gt/tomcat/bin/commons-logging-api.jar -Dcatalina.base=/var/opt/scalix/gt/tomcat -Dcatalina.home=/var/opt/scalix/gt/tomcat -Djava.io.tmpdir=/var/opt/scalix/gt/tomcat/temp org.apache.catalina.startup.Bootstrap start
Tomcat should be listening on port 8009 for connections on the IP address associated with the Scalix instance only. Therefore it should be possible to telnet to port 8009 but only when the instance host name or IP address is specified (not localhost). Next verify that Tomcat is serving basic static content from the webmail servlet. Using a browser or other http client, visit this URL:
Now verify that messaging services (if they're used in this deployment) are available by visiting:
Note that you will need to supply a valid mailbox username and password. Tomcat logs errors to the file:
Any errors or Java exception stack traces in catalina.out should be investigated, especially if the same error is repeated often.
Tomcat Memory Usage
The amount of system memory assigned to Tomcat can be displayed with the ps command:
ps -eO vsize | grep java
or by running the 'top' utility and selecting memory sort order (hit the 'M' key). Tomcat should not use significantly more memory than has been configured for JVM heap. This value is visible in the -Xms and -Xmx options from the Tomcat's jvm command line:
[root@goat logs]# ps -eO vsize | grep java 13136 512784 S pts/3 00:21:27 /usr/java/jre1.5.0_13/bin/java -server -Djava.net.preferIPv4Stack=true -Xms256m -Xmx256m ...
If the Java code running inside Tomcat for some reason ever attempts to allocate more memory than the configured maximum heap size, an 'out of memory' exception will be raised. In most cases this is fatal and Tomcat exits. Normally there will be out of memory errors written to catalina.out, but in some cases the lack of available memory prevents the error messages from being written. Therefore if Tomcat appears to silently exit, the most likely cause is an out of memory exception. The best way to diagnose an out of memory problem is to use the jmap utility to dump the Java heap to a file while Tomcat is still running. The dumped file can be analyzed later to determine which object classes are using the most memory. Jmap is part of the JDK. Scalix installs only the JRE and so is necessary to find and install the specific version of JDK that matches the running JRE (for example 1.5.0_13). More information on jmap is available here . This document gives additional useful background on troubleshooting Java runtime problems .
Tomcat File Descriptor Usage
If Java code running inside Tomcat for some reason allocates file descriptors but never closes them, then over time large numbers of open file descriptors accumulate. There is a system-dependent limit on the maximum number of open file descriptors and once that limit is reached, Tomcat and the web applications is hosts will stop working properly. In most cases error messages relating to the inability to open files will be logged. It can be tricky to make the connection between the error messages and the file descriptor limit having been reached because the errors may occur anywhere that a new file descriptor (or socket) is needed. The list of open file descriptors can be displayed with the lsof command:
lsof -p <pid>
In the event of a resource leak, the filenames associated with the open file descriptors often gives a clue as to the source of the problem. lsof has its own RPM package which may need to be installed.
Dumping Thread Stacks
In situations where is has been established that either:
- Tomcat is using excessive CPU time (for example using 100% of at least one core)
- One of the web applications running in Tomcat has locked up, or become unresponsive
then it's useful to dump the JVM thread stacks (see ). Thread stacks are dumped to Tomcat's catalina.out log file when a QUIT signal is sent to the Tomcat process:
kill -QUIT <pid>
The Webmail web application has its own logs. These are found at:
In addition, webmail can write very detailed activity logs. Activity logging is disabled by default for disk-space and performance reasons. It is almost always necessary to enable activity logging in order to diagnose a problem with SWA. SWA activity logs only contain information pertaining to operations that were processed after they were enabled (obviously!). Therefore if users have reported problems and activity logging has subsequently been enabled, it will be necessary to trigger the problematic behavior again in order to gather the related log output. Activity logs are written to:
The Messaging Services web application has its own logs. These are found at:
In addition, messaging services can write very detailed activity logs. Activity logging is disabled by default for disk-space and performance reasons. It is almost always necessary to enable activity logging in order to diagnose a problem with SWA. SWA activity logs only contain information pertaining to operations that were processed after they were enabled (obviously!). Therefore if users have reported problems and activity logging has subsequently been enabled, it will be necessary to trigger the problematic behavior again in order to gather the related log output. Activity logs are written to:
If the evidence gathered from the web applications' logs suggests that errors or lack of service from the IMAP server is the underlying cause of the problem then it will be necessary to gather diagnostic information from the message server. The details of this process are beyond the scope of this document, but it is often useful to enable IMAP server logging. Instructions on how to do this are available here.