IS servers leaving cluster on a daily basis

Dear colleagues,

We are encountering a very strange issue on the productive environment. We have a cluster with 2 servers (please see settings below). Every day we need to restart both servers so that they can rejoin the cluster.  Can you please help us find out what might be causing this?

IS server:

Product webMethods Integration Server
Version 9.9.0.0
Updates IS_9.9_Core_Fix23
TNS_9.9_Fix5
IS_9.9_SPM_Fix4
Build Number 102

Teracotta config:

<tc:tc-config xmlns:tc=“http://www.terracotta.org/config”>

     <!-- Tolerant timeout settings taken from: http://www.terracotta.org/documentation/high-availability.html -->
     <!-- l2 to l1 is Server timing out (and ejecting) the Client  -->
     <property name="l2.healthcheck.l1.ping.enabled" value="true" />
     <property name="l2.healthcheck.l1.ping.idletime" value="5000" />
     <property name="l2.healthcheck.l1.ping.interval" value="1000" />
     <property name="l2.healthcheck.l1.ping.probes" value="3" />
     <property name="l2.healthcheck.l1.socketConnect" value="true" />
     <property name="l2.healthcheck.l1.socketConnectTimeout" value="5" />
     <property name="l2.healthcheck.l1.socketConnectCount" value="10" />

     <!-- Client reconnection properties -->
     <property name="l2.l1reconnect.enabled" value="true" />
     <property name="l2.l1reconnect.timeout.millis" value="2000" />
server-data server-logs 9510 9520 9530 server-data server-logs 9510 9520 9530 %(com.softwareag.tc.client.logs.directory)

Teracotta client log:

2018-06-02 04:09:31,547 [RemoteTransactionManager Flusher] INFO com.tc.object.tx.RemoteTransactionManagerImpl - ClientID[3683]: Ignoring RemoteTransactionManagerTask because status State[ REJOIN_IN_PROGRESS ]
2018-06-02 04:09:37,564 [RemoteTransactionManager Flusher] INFO com.tc.object.tx.RemoteTransactionManagerImpl - ClientID[3683]: Ignoring RemoteTransactionManagerTask because status State[ REJOIN_IN_PROGRESS ]
2018-06-02 04:09:43,669 [RemoteTransactionManager Flusher] INFO com.tc.object.tx.RemoteTransactionManagerImpl - ClientID[3683]: Ignoring RemoteTransactionManagerTask because status State[ REJOIN_IN_PROGRESS ]
2018-06-02 04:09:43,838 [L1_L2:TCComm Main Selector Thread_R (listen 0:0:0:0:0:0:0:0:38993)] WARN com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(-1.ffffffffffffffffffffffffffffffff.03506bc8-8a62-4248-abb7-212431fe288b-163ba0da4a7.USER): CLOSE EVENT : com.tc.net.core.TCConnectionImpl@1558944424: connected: false, closed: true local=10.12.141.33:33102 remote=10.12.141.33:9510 connect=[Sat Jun 02 04:09:31 CEST 2018] idle=12442ms [0 read, 0 write]. STATUS : SYN_SENT
2018-06-02 04:09:43,838 [L1_L2:TCComm Main Selector Thread_R (listen 0:0:0:0:0:0:0:0:38993)] WARN com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(-1.ffffffffffffffffffffffffffffffff.03506bc8-8a62-4248-abb7-212431fe288b-163ba0da4a7.USER): closing down connection - com.tc.net.core.TCConnectionImpl@1558944424: connected: false, closed: true local=10.12.141.33:33102 remote=10.12.141.33:9510 connect=[Sat Jun 02 04:09:31 CEST 2018] idle=12442ms [0 read, 0 write]
2018-06-02 04:09:43,838 [L1_L2:TCComm Main Selector Thread_W (listen 0:0:0:0:0:0:0:0:38993)] INFO com.tc.net.core.TCConnection - error writing to channel java.nio.channels.SocketChannel[closed]: null
2018-06-02 04:09:55,707 [RemoteTransactionManager Flusher] INFO com.tc.object.tx.RemoteTransactionManagerImpl - ClientID[3683]: Ignoring RemoteTransactionManagerTask because status State[ REJOIN_IN_PROGRESS ]
2018-06-02 04:09:55,709 [TC Memory Monitor] WARN tc.operator.event - NODE : ClientID[3683] Subsystem: MEMORY_MANAGER EventType: MEMORY_LONGGC Message: Detected long GC>8,000ms. GC count:2. GC Time:11,753ms. Frequent long GC cycles cause severe performance degradation.
2018-06-02 04:10:00,760 [L1_L2:TCComm Main Selector Thread_W (listen 0:0:0:0:0:0:0:0:38993)] INFO com.tc.net.core.TCConnectionManager - error event on connection com.tc.net.core.TCConnectionImpl@1558944424: connected: false, closed: true local=10.12.141.33:33102 remote=10.12.141.33:9510 connect=[Sat Jun 02 04:09:31 CEST 2018] idle=29364ms [0 read, 0 write]: null
2018-06-02 04:10:00,772 [RemoteTransactionManager Flusher] INFO com.tc.object.tx.RemoteTransactionManagerImpl - ClientID[3683]: Ignoring RemoteTransactionManagerTask because status State[ REJOIN_IN_PROGRESS ]
2018-06-02 04:10:00,894 [Rejoin Worker] WARN com.tc.platform.rejoin.RejoinManagerImpl - Error during channel open
java.net.ConnectException: Connection refused

Best regards,
Oliver

Not sure if you want to pursue this line of thinking, but perhaps one approach is to eliminate the use of IS clustering? What specific features of IS clustering is your environment specifically leveraging?

Hello,

We need at least 2 servers in a cluster to handle the load which is significant on the productive environment.
Quick question: I can see in a lot of tc server log entries " ClientID[6178] Subsystem: MEMORY_MANAGER EventType: MEMORY_LONGGC Message: Detected long GC>8,000ms. GC count:4. GC Time:15,761ms. Frequent long GC cycles cause severe performance degradation"

What does the 8,000ms mean exactly. Is it the maximum allowed time for gc?
If so, is there any setting on Teracotta side to increase this threshold?

Thank you in advance,
Oliver

You don’t need an IS cluster for that. You just need a load balancer in front of them.

This old, old thread may be helpful.

http://tech.forums.softwareag.com/techjforum/posts/list/40113.page

Note that the posts about scheduled tasks needing IS clustering have been overcome by events – those no longer need IS clustering and instead use a shared DB to manage “run on any single instance” tasks.