Webmethods Integration Server. Performance Issue

Hello All,

We were recently upgraded from 9.12 Integration server to Version 10.5 and no issues on first 2 months and slowly we started seeing lot of memory utilization and CPU utilization.

Lots of large GC’s in logs which in turn leads to failures from cache to both put and remove operations

2021-08-19 18:25:24 CDT [ISS.0033.0155E] Could not create local session 5be05b70386b4ac28b49a3be4afc338e from the cached session values. get timed out.
2021-08-19 18:27:23 CDT [ISS.0036.0009E] Ping Failed to server: server04:7777 Exception: [ISC.0064.9306] Connection was closed during read
2021-08-19 18:33:11 CDT [ISS.0033.0154E] Could not save session 7a627008faa048919f84c7d9fa598587 to the session cache. put timed out.

We have already performed few tuning changes on our end
Terracotta server memory is increased from 2 GB to 4GB on 4th Aug
DBA increased the Login time out from. 60 seconds to 90 seconds for Webmethods to ERP connections on 5th Aug

  • Finally VM memory (On OS) has been increased from 25 GB to 40 GB (on all 4 nodes) on 9th Aug

Issue mostly seen 90% in one VM[Node03) and 10% in other VM (Node04)

Each VM has one A2A integration server [for internal application] and one B2B Integration server [For External partners connections via TN)
In Node 03 we see on alternate days or weekly twice either A2A integration server or B2B integration server leads to similar issue (Either More IS to Terracotta errors such as Put timed out or Get Timed out and few Network errors and server performance degraded and goes to hung state)

We had taken Diagnostics logs during the time period and shared with Software AG , they confirm that this is not an IS issue and no tuning is recommended via Techsupport.

Same type of issue discussed in some previous blogs but ended up with no definite answer on how they resolved this issue

Thanks,
Barry

Performance issues are about investigating and narrowing down the chinks, before arbitrarily throwing resources, although you probably did it because of business constraints. Since this worked for 2 months post-upgrade, focus on what features/changes you have introduced since the issue started.

I have linked a post, below; use those tools to narrow down the issue. Since you see frequent GCs, then the next obvious spaces are your configuration and code (possible memory leaks). Enable auditing for your parent and key services, examine any unusual resource consumption or response times.

Like I say, there are too many cogs to investigate; so proceed with the rule of elimination.

KM