Diag/swa-problems

From Scalix Wiki
Jump to: navigation, search

Contents

Diagnosing Problems with Scalix Web Access

Introduction

SWA depends on several major subsystems for proper operation. These include the Apache web server, the Tomcat application server running several web applications, Scalix LDAP service and the primary mail store which is accessed via IMAP. When problems are experienced by SWA users that are systemic in nature (that is, not a specific error related to performing one specific operation or referencing one specific message) it can be hard to pin down the source of the trouble. The many interacting subsystems, each with its own set of logs and configuration, present significant problem diagnosis challenges. The purpose of this document is to provide guidance on how to proceed and to collect together all the techniques that have proven useful in collecting diagnostic information from production SWA deployments.

Architecture

Beginning at the SWA user's browser, the system is architected as follows: The browser fetches static content and javascript code from a web server. Once the browser-resident javascript code is running, it too sends requests to the web server. The web server is the standard Apache 2.0 or 2.2 running on the machine that hosts the SWA service (which we will call the front-end server). Apache's job is to handle client HTTP connections, to perform SSL processing and to send the client requests on to Tomcat. TCP connections made by users' browsers to Apache are on port 80 or 443 (for SSL). Tomcat is essentially a web server, but with the special capability to run Java programs in-process. Tomcat itself is written in Java. We use Apache, rather than Tomcat, to handle the end-user browser connections because Apache has better scaling and security than Tomcat, and because SSL is more efficiently handled by Apache. Communication between Apache and Tomcat is via the AJP protocol which is similar but not identical to HTTP. The use of AJP allows the number of connections made to Tomcat to be kept reasonably low. This is because AJP, unlike HTTP, supports connection re-use for requests from different clients. AJP connections between Apache and Tomcat use port 8009. Tomcat is a web server written in the Java language and supporting the servlet interface. A servlet is a bundle of Java code that accepts HTTP requests and processes them to produce responses. The servlet concept has been expanded over time to include various associated aspects of application deployment such as packaging and the specification of URL paths, leading to the definition of the 'web application'. Although there are many web applications running in Tomcat, two are of primary importance for SWA : the swa server (called 'webmail') and the scalix messaging services, or platform (called 'api'). The webmail servlet is responsible for serving all SWA static content : images, html pages, css files and javascript. It also handles all AJAX (XmlHttpRequest) calls made by the browser-resident javascript. To perform certain operations the webmail servlet will in turn make a request to the messaging services servlet. The use of messaging services is optional. The default configuration is to use messaging services. In smaller deployments the messaging services servlet running on the same machine as webmail will be used (hence running in the same Tomcat instance). However is larger deployments it is possible that multiple messaging services web applications have been deployed, on several machines. Both webmail and messaging services in turn connect to LDAP and IMAP servers to access user information and message store content. All the subsystems : Apache, Tomcat, webmail, messaging services, IMAP and LDAP must be available and working properly in order for SWA to function correctly.

Subsystem Connectivity Requirements

The various subsystems that comprise SWA need to have certain TCP ports open between each other. For example the SWA server needs to have IMAP (port 143) access to every user's message store server. The presence of exceptions like this in the web application logs is a sign that there is a connectivity problem between subsystems (network outages or a mis-configured firewall):

2008-03-14 05:00:56,380  WARN [AddressBook.lookupAttributes:169] Communication
failure with LDAP server
javax.naming.CommunicationException: server.example.com:389 [Root exception is
java.net.ConnectException: Connection timed out]

Problem Diagnosis Methodology

The best approach when presented with a problem reported by SWA users is to begin at the client and work progressively 'inwards' towards the IMAP service, examining service health at each subsystem along the way. Techniques for the analysis of each subsystem are discussed below:

Javascript Client

SWA's browser-resident Javascript code has a diagnostic logging capability. This feature is disabled by default for security and performance reasons. Follow these instructions to enable client diagnostic logging.

System Resource Usage

Before investigating specific Scalix subsystems, it is always useful to gather an overview of system health and utilization. This should be done for each system that is involved in processing SWA operations (for example both front-end web application machines, and back-end IMAP servers). The Linux 'top' utility displays CPU utilization (and memory usage per-process if the 'M' sort order is selected). iostat displays a view of disk utilization:

[root@goat logs]# iostat -x 1
Linux 2.6.18-53.1.4.el5 (goat.mtbrook.bozemanpass.com)  02/20/2009

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.24    0.00    1.03    0.05    0.00   98.68

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.02     1.46  0.11  1.44     2.02    23.20    16.26     0.00    1.27   0.54   0.08
dm-0              0.00     0.00  0.11  2.87     1.87    22.96     8.33     0.02    5.24   0.28   0.08
dm-1              0.00     0.00  0.02  0.03     0.15     0.24     8.00     0.00   21.17   0.29   0.00
hdc               0.00     0.00  0.00  0.00     0.00     0.00    15.20     0.00    1.80   1.30   0.00

Pay particular attention to the last column (%util). This shows the proportion of the sample period (in this case one second) that the disk was busy. A value near 100% means that the disk is saturated and performance problems should be expected in that case. iostat is part of the 'sysstat' rpm package.

Apache

First check that Apache is running. Use:

/etc/init.d/httpd status

Then verify that Apache is serving basic static web pages. Point a browser at :

http://<server>/

This should display the default Apache home page, or if Apache has been configured to do so, a redirect to /webmail You can also use the telnet command to verify that Apache is accepting connections on ports 80 and 443:

telnet <server> 80

Next, examine Apache's log files. These are stored at /var/log/httpd:

[root@goat bin]# ls -lt /var/log/httpd
total 5268
-rw-r--r-- 1 root root 1173498 Feb 20 00:08 access_log
-rw-r--r-- 1 root root    2458 Feb 20 00:06 error_log
-rw-r--r-- 1 root root     316 Feb 15 04:02 error_log.1
-rw-r--r-- 1 root root    4015 Feb  8 04:02 error_log.2
-rw-r--r-- 1 root root 4048819 Feb  5 02:53 access_log.1
-rw-r--r-- 1 root root     947 Feb  1 04:02 error_log.3
-rw-r--r-- 1 root root  125984 Jan 31 16:25 access_log.2
-rw-r--r-- 1 root root     739 Jan 28 23:00 access_log.3
-rw-r--r-- 1 root root     316 Jan 25 04:02 error_log.4
-rw-r--r-- 1 root root     739 Jan 14 11:36 access_log.4

If SSL is used there will be a second set of log files with the ssl_ prefix. access_log and ssl_access_log contain a log line for each completed request made by clients. This information can be useful to verify that clients are successfully contacting Apache, and to see what type of requests they are making. error_log and ssl_error_log will generally be empty on a server that has no problems, except for the occasional message like these:

[Sun Feb 15 04:02:04 2009] [notice] Apache/2.2.3 (CentOS) configured -- resuming normal operations
[Thu Feb 19 21:17:49 2009] [error] [client 69.145.82.247] File does not exist: /var/www/html/favicon.ico
[Thu Feb 19 21:23:50 2009] [error] [client 69.145.82.247] File does not exist: /var/www/html/favicon.ico
[Thu Feb 19 21:46:32 2009] [error] [client ::1] Directory index forbidden by Options directive: /var/www/html/

If there are large numbers of messages, or messages indicating errors have occurred , these may be useful in diagnosing the problem and should be retained for analysis. For example, here are some messages showing that Tomcat was unavailable:

[Wed Feb 04 03:35:37 2009] [error] proxy: AJP: disabled connection for (xxxxxxx)
[Wed Feb 04 03:48:18 2009] [error] (111)Connection refused: proxy: AJP: attempt to connect to xx.xx.xx.xx:8009 (xxxxxxx) failed
[Wed Feb 04 03:48:18 2009] [error] ap_proxy_connect_backend disabling worker for (xxxxxxx)
[Wed Feb 04 03:48:18 2009] [error] proxy: AJP: failed to make connection to backend: xxxxxxx
[Wed Feb 04 03:49:59 2009] [error] (111)Connection refused: proxy: AJP: attempt to connect to xx.xx.xx.xx:8009 (xxxxxxx) failed
[Wed Feb 04 03:49:59 2009] [error] ap_proxy_connect_backend disabling worker for (xxxxxxx)
[Wed Feb 04 03:49:59 2009] [error] proxy: AJP: failed to make connection to backend: xxxxxxx

The most common types of Apache error seen are:

# Failure to contact Tomcat via AJP, meaning that Tomcat is either down or unresponsive.
# Connection overlimit errors, meaning that for some reason operations are taking a long time to process, leading to more and more connections accumulating under high load.
# Operations errors in Apache 2.0 (RHEL4) AJP proxy code.

It's useful if any error messages can be correlated in time with the specific problem under investigation having occurred. There may be error messages present that have nothing to do with the problem at hand. Check the number of Apache processes running. A large number (hundreds) would be unusual. Use:

ps -ef | grep httpd | wc -l

Also check the number of sockets open on the Apache ports. Use:

netstat --numeric-ports -t -p | grep ":80 "

and

netstat --numeric-ports -t -p | grep ":443 "

Large numbers of open connections (more than one per active user), or large numbers of sockets in unconnected states (TIME_WAIT etc) might indicate trouble.

Tomcat

First check that Tomcat is running:

[root@goat /]# /etc/init.d/scalix-tomcat status
Instance (goat) is not running

You should also see the Tomcat process (its executable is 'java'). Its pid will be useful later:

[root@goat /]# ps -ef | grep java
root     13136     1  0 Feb05 pts/3    00:20:43 /usr/java/jre1.5.0_13/bin/java -server -Djava.net.preferIPv4Stack=true 
-Xms256m -Xmx256m -Dscalix.instance=/var/opt/scalix/gt -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager 
-Djava.util.logging.config.file=/var/opt/scalix/gt/tomcat/conf/logging.properties 
-Djava.endorsed.dirs=/var/opt/scalix/gt/tomcat/common/endorsed -classpath
/usr/java/jre1.5.0_13/lib/tools.jar:/var/opt/scalix/gt/tomcat/bin/bootstrap.jar:/var/opt/scalix/gt/tomcat/bin/commons-logging-api.jar 
-Dcatalina.base=/var/opt/scalix/gt/tomcat -Dcatalina.home=/var/opt/scalix/gt/tomcat 
-Djava.io.tmpdir=/var/opt/scalix/gt/tomcat/temp org.apache.catalina.startup.Bootstrap start

Tomcat should be listening on port 8009 for connections on the IP address associated with the Scalix instance only. Therefore it should be possible to telnet to port 8009 but only when the instance host name or IP address is specified (not localhost). Next verify that Tomcat is serving basic static content from the webmail servlet. Using a browser or other http client, visit this URL:

http://<server>/webmail

Now verify that messaging services (if they're used in this deployment) are available by visiting:

http://<server>/api

Note that you will need to supply a valid mailbox username and password. Tomcat logs errors to the file:

/var/opt/scalix/<instance>/tomcat/logs/catalina.out

Any errors or Java exception stack traces in catalina.out should be investigated, especially if the same error is repeated often.

Tomcat Memory Usage

The amount of system memory assigned to Tomcat can be displayed with the ps command:

ps -eO vsize | grep java

or by running the 'top' utility and selecting memory sort order (hit the 'M' key). Tomcat should not use significantly more memory than has been configured for JVM heap. This value is visible in the -Xms and -Xmx options from the Tomcat's jvm command line:

[root@goat logs]# ps -eO vsize | grep java
13136 512784 S pts/3    00:21:27 /usr/java/jre1.5.0_13/bin/java -server -Djava.net.preferIPv4Stack=true -Xms256m -Xmx256m 
...

If the Java code running inside Tomcat for some reason ever attempts to allocate more memory than the configured maximum heap size, an 'out of memory' exception will be raised. In most cases this is fatal and Tomcat exits. Normally there will be out of memory errors written to catalina.out, but in some cases the lack of available memory prevents the error messages from being written. Therefore if Tomcat appears to silently exit, the most likely cause is an out of memory exception. The best way to diagnose an out of memory problem is to use the jmap utility to dump the Java heap to a file while Tomcat is still running. The dumped file can be analyzed later to determine which object classes are using the most memory. Jmap is part of the JDK. Scalix installs only the JRE and so is necessary to find and install the specific version of JDK that matches the running JRE (for example 1.5.0_13). More information on jmap is available here [1]. This document gives additional useful background on troubleshooting Java runtime problems [2].

Tomcat File Descriptor Usage

If Java code running inside Tomcat for some reason allocates file descriptors but never closes them, then over time large numbers of open file descriptors accumulate. There is a system-dependent limit on the maximum number of open file descriptors and once that limit is reached, Tomcat and the web applications is hosts will stop working properly. In most cases error messages relating to the inability to open files will be logged. It can be tricky to make the connection between the error messages and the file descriptor limit having been reached because the errors may occur anywhere that a new file descriptor (or socket) is needed. The list of open file descriptors can be displayed with the lsof command:

lsof -p <pid> 

In the event of a resource leak, the filenames associated with the open file descriptors often gives a clue as to the source of the problem. lsof has its own RPM package which may need to be installed.

Dumping Thread Stacks

In situations where is has been established that either:

  1. Tomcat is using excessive CPU time (for example using 100% of at least one core)
  2. One of the web applications running in Tomcat has locked up, or become unresponsive

then it's useful to dump the JVM thread stacks (see [3]). Thread stacks are dumped to Tomcat's catalina.out log file when a QUIT signal is sent to the Tomcat process:

kill -QUIT <pid>

Webmail

The Webmail web application has its own logs. These are found at:

/var/opt/scalix/<instance>/tomcat/logs/scalix-swa.log*

In addition, webmail can write very detailed activity logs. Activity logging is disabled by default for disk-space and performance reasons. It is almost always necessary to enable activity logging in order to diagnose a problem with SWA. SWA activity logs only contain information pertaining to operations that were processed after they were enabled (obviously!). Therefore if users have reported problems and activity logging has subsequently been enabled, it will be necessary to trigger the problematic behavior again in order to gather the related log output. Activity logs are written to:

/var/opt/scalix/<instance>/tomcat/logs/scalix-swa-activity.log*

Messaging Services

The Messaging Services web application has its own logs. These are found at:

/var/opt/scalix/<instance>/tomcat/logs/scalix-api.log*

In addition, messaging services can write very detailed activity logs. Activity logging is disabled by default for disk-space and performance reasons. It is almost always necessary to enable activity logging in order to diagnose a problem with SWA. SWA activity logs only contain information pertaining to operations that were processed after they were enabled (obviously!). Therefore if users have reported problems and activity logging has subsequently been enabled, it will be necessary to trigger the problematic behavior again in order to gather the related log output. Activity logs are written to:

/var/opt/scalix/<instance>/tomcat/logs/scalix-api-activity.log*

IMAP

If the evidence gathered from the web applications' logs suggests that errors or lack of service from the IMAP server is the underlying cause of the problem then it will be necessary to gather diagnostic information from the message server. The details of this process are beyond the scope of this document, but it is often useful to enable IMAP server logging. Instructions on how to do this are available here.