Web Site Performance

Dear OSEHRA Members,

A number of members have noted that the web site performance has been unacceptably slow, and that over the last few days it seems to be getting slower. 

We too have noted this problem, and are taking steps to isolate it, identify its cause and correct it.

As you may be aware, OSEHRA is built upon the Drupal open source content management platform.  The platform is maintained and developed by a large community of international users and developers.  So we are not only attempting to resolve the issue within our own internal resources, but are also reaching out to the Drupal open source community to improve your user exoerience.

As a community we will keep you informed.  We appreciate your continued support and interest in OSEHRA and hope to have this problem resolved soon, so we can get back to the serious business of improving the Open EHR.

Thank you in advance for patience,

Conrad ClyburnCommunity DevelopmentOSEHRA (Open Source EHR Agent), Inc.clyburnc@osehra.org(571) 858-3205(301) 404-9128 (cell)
like0

Comments

OSEHRA web site performance issues

Fred Prior's picture

Dear OSEHRA Members,

I want to follow up on the posting from Conrad Clyburn concerning web site performance. As he indicated we take your reports of poor performance very seriously and have been working to understand the source of the problem.  I want to provide an update of where we are and what steps we have taken.

The OSEHRA.org web site is hosted on a Dell 510 server running the XEN hypervisor and CentOS 5.6.  The virtual machine is configured with 4 CPUs, 80 GB hard drive, 6 GB swap, 4 G memory.   The site itself is built on Apache/2.2.3 and Drupal Commons 6.x-1.7 which is in turn supported by Drupal core 6.22 and MySQL 5.1.58.  Membership management is supported by the CiviCRM add-on package. 

The physical server is connected by 1Gbps Ethernet to the 10Gbps network backbone at the hosting site.  Current Internet connectivity is 2Gbps to the commercial Internet and 1Gbps to Internet II.  Regular monitoring of server resources and network traffic show very low utilization and no pattern of network congestion.

When performance issues were first reported we took steps to tune the system and add performance enhancements.  We added APC (http://pecl.php.net/package/APC), a PHP accelerating cache (http://en.wikipedia.org/wiki/PHP_accelerator), to the production server.  We noted considerable improvement in page load times.  We also adjusted MySQL configuration parameters, increasing the sizes and numbers of buffers, caches, etc., following the recommendations from Red Hat and MySQL for a very large (huge) system.

We have run the Acquia Insight site assessment tool (http://www.acquia.com/products-services/acquia-network/cloud-services/insight) and confirmed that our configuration is fine.  There are some cache settings we cannot enable since this is Drupal Commons and not a static website, but that was expected.  We made a few minor adjustments and our Insight Score is 99%. 

So from the server side, nothing appears to be amiss.  One of the members of the OSEHRA team created a test script (timeTST) that allows us to explore access times for specific pages from any location on the internet::

#!/bin/bash

echo "time_namelookup,time_connect,time_appconnect,time_pretransfer,time_redirect,time_starttransfer,time_total"

while :

do

                result=`curl -m 60 -o /dev/null -s -w %{time_namelookup},%{time_connect},%{time_appconnect},%{time_pretransfer},%{time_redirect},%{time_starttransfer},%{time_total} $1`

                time=`date --rfc-3339=s`

                echo $time,$result

                sleep 1

done

We have run this script from many locations around the country and a wide range of client platforms.  We also compared access to a page on the OSEHRA site to a page on a Confluence Wiki site also hosted in the same data center:

$ ./timeTst http://osehra.org/group/development-tools > tests_OSEHRA

$ ./timeTst https://wiki.cancerimagingarchive.net/display/Public/Research+Projects > tests_TCIA

What we see from this data is a timeout (limited to 60 sec) that seems to occur while waiting for data from the server.   Here is an example:

> 2011-12-07 16:36:51-05:00,0.317,0.356,0.000,0.356,0.000,0.399,0.518

> 2011-12-07 16:36:53-05:00,0.005,0.049,0.000,0.049,0.000,0.096,0.226

> 2011-12-07 16:36:54-05:00,0.004,0.046,0.000,0.046,0.000,0.092,0.265

> 2011-12-07 16:36:55-05:00,0.004,0.044,0.000,0.044,0.000,0.087,0.244

> 2011-12-07 16:36:57-05:00,0.005,0.044,0.000,0.044,0.000,0.087,0.248

> 2011-12-07 16:36:58-05:00,0.004,0.044,0.000,0.044,0.000,0.087,0.251

> 2011-12-07 16:36:59-05:00,0.005,0.044,0.000,0.044,0.000,0.088,0.249

> 2011-12-07 16:37:00-05:00,0.005,0.044,0.000,0.044,0.000,0.088,0.249

> 2011-12-07 16:37:02-05:00,0.004,0.044,0.000,0.044,0.000,0.087,0.245

> 2011-12-07 16:37:03-05:00,0.004,0.044,0.000,0.044,0.000,0.087,0.247

> 2011-12-07 16:38:04-05:00,0.005,0.052,0.000,0.052,0.000,0.098,60.000

> 2011-12-07 16:38:05-05:00,0.005,0.046,0.000,0.046,0.000,0.092,0.258

> 2011-12-07 16:38:06-05:00,0.004,0.043,0.000,0.043,0.000,0.086,0.244

The highlighted row illustrates the occurrence of a timeout event.    We have tried to match the delayed responses to simultaneous HTTP requests in the server access_log to see if the delays occurred while serving other pages, but found nothing unusual.  Also, there are no corresponding errors in the Apache error_log, the internal Drupal log, messages, etc. Yet this pattern is repeatable when access is from a slow home network in St. Louis or a an even slower hotel network in Salt Lake City, while it has never been observed from the hosting site regardless of how many clients are executing the script simultaneously.  Similarly, no observer has reported such a timeout event when accessing the non-OSEHRA wiki page.  What is most interesting is that access time (the rightmost number is the total access time in seconds) is quite consistent and sub second per page, then a timeout occurs after which performance returns to the previous pattern.  Note that this script accesses exactly the same web page each time so one would expect consistent performance numbers governed by network access.

So the slowness is real, it can be reproduced using this script and it seems to be associated with the OSEHRA site, yet there is no correlation with any measurement we have of server performance. It also seems to be worse on slower client networks. i.e. the frequency at which the timeout events occur increases on a slower client network.  

This is where we are at the moment.  We are instantiating a test version of the OSEHRA web site at another physical location, on different hardware, and different network infrastructure and we will repeat our tests to see if we can identify the root cause of the problem.  We are also establishing a copy of the site that is not part of our cloud environment, i.e. does not use a virtual machine but a single operating system images on a dedicated server.

We sincerely apologize for the inconvenience and will gladly accept any insights the community might wish to share on potential causes of the problem.

Fred Prior

OSEHRA (Open Source EHR Agent), Inc.priorf@osehra.org

 

like0

OSEHRA Web Site Performance Issues Resolved

Fred Prior's picture

As you well know the OSEHRA web site has suffered from performance problems for some time now.  This normally manifest as pages that would hang and not refresh.  As we reported earlier, a lot of effort by the OSEHRA team and the community has gone into tracking down the cause of this problem.  We are happy to report that we have found the cause and believe we have a solution in place.
 
There is a bug in the iptables implementation in the Linux kernel (Redhat versions 5/6 and derivatives therefrom such as the CentOS version used on the virtual machines that host the OSEHRA site).  Apparently the developers were trying to optimize the window calculation (how long to wait for an acknowledgement packet) but introduced a bug that causes conversations to be prematurely terminated when network latencies are high. The problem may be exacerbated by an interaction with the Cisco firewall modules we use to protect our servers.  We have an interim workaround in place that seems effective and we are communicating with Cisco and Redhat for a more appropriate, permanent solution to the problem. There is almost no documentation on this bug – it does not appear to have been reported to Redhat.
 
As an aside, I mentioned that the problem was exacerbated on a high latency network.  A major source of data used to track down the bug came from a packet sniffer run on a highlatency network, thanks to a local coffee shop.  We also received important data from research labs at Harvard and Washington University who had experienced similar problems when moving large blocks of data.
 
We thank the community for your patience and for your active participation in finding and solving the problem.
 

like0