Integration Server Sinking Issue Multiple times

Venkata_Siva_Subrahmanyam_Chavali · October 3, 2013, 5:14am

Hi,

I am working in a webMethods environment where only 2 integration servers are there in cluster and most of the load is handled by the Node 1 while only particular applications are running on Node2.

The traffic on the node 2 is relatively less than that of node 1. However, we have faced an issue that the IS is going unresponsive at times and the transactions are taking very much time to get processed. At times, the transactions transit time is even observed to be in minutes as well.

Most of our interactions are with DB2 servers and at the time of server sinking we could observe that the number of system threads are going to a high level (still within the maximum limit or even within the threshold limit), memory utilization is going near to 95 % + (which is usually 80% - 90%).

Server logs are showing “Broken pipe error” and “SQL30108N. A connection failed but has been re-established.” but nothing else.

We had to restart the server to let the server respond properly and I feel every time restarting a production Integration Server shouldn’t be solution.

Can some one suggest, How can I troubleshoot this issue. (During this issue, server is completely unresponsive to do any action through the IS Admin console.)

Thanks
SS

Tong_Wang · October 3, 2013, 10:41pm

based on what you described, seems the slowness happened in the DB calls. Work with your DBA to find in which DB activity the system is waiting the most time.
It may be as simple as adding some index.

Venkata_Siva_Subrahmanyam_Chavali · October 4, 2013, 3:43am

Hi Tong wang,

Thanks for your reply.

In fact, I have checked the logs for my code through which the DB calls were being made. Some services were observed to be taking more time around 45 sec to 7 mins for execution while talking to the DB (These services also make LDAP calls to validate the user authentication and authorization) during the time frame when the server becomes unresponsive. These services are usually taking 90 - 200 milli sec for execution under normal times.

In the clustered environment, one server is working fine but the other server is only sinking down all the time, because of the above reason.

So, I am doubting is there some else we need to look upon? Please let me know your view based on above observation.

Thank
SS

Tong_Wang · October 4, 2013, 3:49pm

If you can add some logging, you can find if LDAP call is slow or DB is slow. My guess will be DB still.
You have to get your DBA involved, there is no other way to find out the root cause.