Servers constantly get out of cluster

Hi there,

In the last 2-3 weeks we are constantly encountering the situation of having one IS server leave one of our 3-server cluster (IS version is 9.9). I can only assume that this has to do with some service which is generating a lot of load causing long gc cylcles. Do you have any idea what is the best way to track down the services which are causing the most load? I know that I can enable the generation of gc logs but from experience these logs have never helped us identify the culprit in an efficient way (the last time we had this kind of issue it was because of an unzipping Java service which was allocating a lot of unnecessary memory and it was extremely hard to find this).

Best regards,
n23

This may seem a bit of an unusual approach, but perhaps one way to solve this is to stop using IS clustering. Is there specific function being leveraged that requires its use? It has been my experience that it is mostly not necessary (indeed, there are couple of old threads on the forums about avoiding IS clustering). There are only a couple of specific cases that really need it.

Regarding the “I can only assume…” – beware of assumptions. Be driven by data/logs if possible. We’ve all experienced cases where we chase something (based upon guesses) and then find out the root issue was something else – and the error message was telling us the issue all along but we didn’t trust it/pay attention.

The stats.log may give clues about when resources are being exhausted or close to it. Use the data in those files to create a chart in Excel. That can help identify “server is in trouble” time frames, which can lead to investigating possible culprits in the audit logs.

1 Like

If you’re certain that the root cause is service(s), you can use profilers such as VisualVM to sift through heap dumps to pick unusually high memory consumption or memory leaks, and check CPU consumption for your services.

There was an IS package named RichStatistics shared on the forums (unofficial, so do not use on Production unless you know what you are doing), which you can use to identify services with long execution times and drill-down, as long as your service auditing principles are good. MWS can be used, but RichStatistics is easier and more visual. Caveat - it consumes resources heavily, if you’re evaluating a medium-large service execution history.

However, I’d suggest that you investigate the platform/application first. Is this particular IS unduly overloaded because your Load Balancer isn’t working? Is there unreliable bandwidth/latency that’s causing your node to pop out of the cluster? Keep in mind that your IS nodes may have been sized a while ago, but your volumes have increased since. Perhaps you need to re-size them, but don’t do this without evidence.

@reamon hits the nail on the head - do you really need Clustering? He has also suggested stats.log which is a good window into your IS. There are too many cogs involved in a performance issue, so follow the process of elimination, but base your decisions on concrete evidence.

P.S - Consider an upgrade, since the longer the gap between your version vs the latest, the more risks you’ll bear when you eventually upgrade.

1 Like

Hi,

When we had this issue on a different cluster we also thought that the problem was with the IS sizing but everything was caused by a crappy service. After replacing it the servers started running smoothly. This is why I think that there is a bad implementation causing this.

n23

Hi N 23,
while the best approach would be for sure an support ticket to find the root cause in order to help you here a few more details would be helpful.
What kind of clustering you talk about? I would assume you mean an TSA based stateful cluster.
In this case to narrow down you problem various logs are interesting.
On IS side the server.log, stats.log, gc.log (if you have) and the wrapper.log
First goal should be understanding and avoiding the root cause for FullGC’s.
On TSA side we also should look at the logs to avoid the problem sit there.

GDPR notice: While gc.log, stats.log and wrapper.log have by nature no personal information, please make sure in case you share a server.log that it was “sanitized”.

The questions from above are for sure also valid: Do you really need a stateful cluster?

Hi,

I will for sure also open a support ticket but I wanted to know if there are simple and fast ways for finding for example that a service is messing up. I also tried to find the RichStatistics package to test on the dev environment (as I also noticed that dev is running very slow) but I could not find it anymore.
Yes, we are using a TSA based cluster, can’t really answer why we are using this since I was not involved when the environments got set up. What is the most efficient alternative to this for instance if you want to distribute load on 3 identical servers?

n23

Put an LoadBalacer in front of the 3 IS nodes. (e.g. “least connection” policy).
If that needs to be configured “sticky” again depends if you need stateful or stateless cluster.

1 Like

Hi,

This depends whether you have a single integration with high load which needs to be distributed over several instances or if you have multiple integration which can be separated on different instances while sharing the Messaging System and the MWS for transport and monitoring.
In the second scenario there is no need for clustering and LoadBalancer, but you can just control the load on the IS instances by putting two parts with high load on different instances.

Regards.
Holger

If it turns out to be a set of services that are the culprits, then Load Balancing, choice of algorithms, throwing resources at them, JVM tuning, etc., may not yield the best results. At that point, you’ll have to redesign and redevelop your resource-guzzling interfaces - this is where you’ll maximize the returns, from my experience.

I see that RichStatistics was rebranded and commercialized (post), so you can use Optimize or other external tools if you have the licenses. VisualVM is an opensource option.

KM

1 Like

Thanks for the replies!
I will give it a try with VisualVM. Is there any tutorial for using it with webMethods (finding the services which cause performance issues)? I assume that it’s safe to remote connect to the prod jvms right?

n23

VisualVM operates on the JVM of the IS, so it’s not specific to wM. Any profiler will consume resources, so try it out on a lower environment first and gauge if it’s acceptable, especially since your IS is already being overloaded.

jvisualvm and jconsole ship with the JVM, so you will find them both under your IS installation here - /rootFolder/softwareag/jvm/jvm/bin. When you run them from within the same installation, they should connect automatically as long as the JMX port is not blocked for security reasons.

There is online content available on how to use them and troubleshoot.

KM

Hi there,

To give you an update, we managed to connect to the DEV IS via the JMX port and noticed that we constantly have >80% CPU utilization. In the Sampler when going to hot spots I can see that we got the nativeSAP_CMLISTEN using 80% of CPU. Does this mean that there is an issue with one of our SAP listeners?

Br,
n23

Your second picture looks like the JVM is short on memory. Why does it permanently have > 10 GB used heap?
How does your JVM “GC.LOG” look like, do you see any fullGC entries in there?
Being short on memory means naturally you have high CPU load caused by the JVM itself in order to free up more memory.
The “active” treads you see in such case are the ones asking for that memory.
This must not mean they are necessarily the source of them problem, they can also just be active and the victims of the situation.
When you take a heap dump and check that in the right tools like MAT you can find out your big memory consumers.

Did you manage to solve the issue? Can you share the resolution?

KM