Among the settings used to configure the connections between Ehcache 2 (also known as "clients" or "L1") and Terracotta servers (also known as "TC Servers" or "BigMemory servers" or "L2"), the health check settings are somewhat, sometimes, misunderstood ... To clear things up, let's dive into those settings and see how they affect the system !
For this article, we'll need :
- a BigMemoryMax kit (4.x), that you can get from the Terracotta.org website
- a sample Ehcache client example (we'll use one from the code samples included in the kit)
- a debugger to pause threads (using an IDE such as Eclipse or IntelliJ to run the sample client will be sufficient)
- to run the client and the server on 2 different hosts (I've been using my laptop and a Docker daemon running in a Minikube VM)
The basics: start with the documentation
The documentation explains how to configure the health checker and even gives few examples
A bit further: default properties in the tc.jar archive
On the other hand, if you open up the tc.jar archive (available under BIGMEMORY_LOCATION/server/lib/tc.jar
) and read the com/tc/properties/tc.properties
file, you'll also find mentions of the health checker configuration and its default values.
The configurations settings and their meanings
Since there are 3 sets of values (one for L1 -> L2 healthcheck, another for L2 -> L1 and another for L2 -> L2 passive sync), we can choose 1 set, and infer the meaning for the 2 other sets.
Let's investigate L2 -> L1 healthcheck.
In the tc.properties file, we have the following comments and default values:
###########################################################################################
# Section : HealthChecker { server(l2)->client(l1), server(l2)->server(l2) (HA), client(l1)->server(l2) }
# ping.enabled - If true, healthchecker is enabled.
# ping.idletime - Connection idletime (in milliseconds), after which healthchecker
# starts its ping test.
# ping.interval - The interval (in milliseconds) between healthchecker sending ping
# messages to the peer node which doesn't reply to its previous msgs.
# ping.probes - Total number of ping messages to be sent to the peer node before
# concluding the peer is dead.
# socketConnect - If true, apart from above ping-probe cycle, healthcheker does extra
# check like socket connect (to detect long GC) to see if the peer has
# any traces of life left
# socketConnectCount - Max number of successful socket connect that healthcheker
# can trust. Beyond which, no socket connects will be
# attempted and peer node is tagged as dead.
# socketConnectTimeout - Socket timeout (integer, in number of ping.interval) when
# connecting to the peer node. On timeout, healthchecker
# concludes peer node as dead irrespective of previous
# successful socket connects.
###########################################################################################
# L2 -> L1 :
# These settings will detect a network disconnect (like a cable pull) in 10 seconds and
# will allow a 40 second GC in the L1
l2.healthcheck.l1.ping.enabled = true
l2.healthcheck.l1.ping.idletime = 5000
l2.healthcheck.l1.ping.interval = 1000
l2.healthcheck.l1.ping.probes = 3
l2.healthcheck.l1.socketConnect = true
l2.healthcheck.l1.socketConnectTimeout = 5
l2.healthcheck.l1.socketConnectCount = 10
I did not include the checktime settings that deal with time difference between hosts, because they're fairly trivial to understand.
Nothing to add to those comments, they match what the documentation explains; but let's investigate the sentence "These settings will detect a network disconnect (like a cable pull) in 10 seconds and will allow a 40 second GC in the L1" ; how did those 10 and 40 seconds got calculated ?
The answer by the demo
For this demo, we'll need to start the Terracotta Server on another host, I've chosen to run it in a Docker container running in a VM (that will allow me to have "clean" state without effort and will also allow me to simulate network failures easily).
After I determined the IP address of the host running the server (192.168.99.100
), I needed to create a Terracotta configuration (stored under ~/.tc/tc-config-docker.xml
for example) :
<?xml version="1.0" encoding="UTF-8"?>
<!-- All content copyright Terracotta, Inc., unless otherwise indicated. All rights reserved. -->
<tc:tc-config xmlns:tc="http://www.terracotta.org/config"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.terracotta.org/schema/terracotta-9.1.xsd">
<servers>
<mirror-group group-name="OnlyStripe">
<server host="192.168.99.100" name="OnlyServer">
<data>/root/server-%h-1/StripeOne_ActiveServer1_server-data</data>
<logs>/root/server-%h-1/StripeOne_ActiveServer1_server-logs</logs>
<data-backup>/root/server-%h-1/StripeOne_ActiveServer1_server-backups</data-backup>
<tsa-port>9510</tsa-port>
<tsa-group-port>9530</tsa-group-port>
<offheap>
<enabled>true</enabled>
<maxDataSize>512m</maxDataSize>
</offheap>
</server>
</mirror-group>
<update-check>
<enabled>false</enabled>
</update-check>
<garbage-collection>
<enabled>false</enabled>
</garbage-collection>
<restartable enabled="true"/>
</servers>
<clients>
<logs>/root/client-logs</logs>
</clients>
</tc:tc-config>
Pretty basic 1 stripe, 1 server configuration; only the IP address here is interesting.
We'll also need debug logs, but since they can be overwhelming, we'll target one single package (that contains all the healthcheck code); in this case the location and the name of this property file is very important (~/.tc.dev.log4j.properties
)
log4j.logger.com.tc.net.protocol.transport=DEBUG, stdout
log4j.logger.com.tc.net.protocol.transport.WireProtocolAdaptorImpl=OFF
In a nutshell: we want debug logs for all classes in the package named com.tc.net.protocol.transport
but not from the class WireProtocolAdaptorImpl
(unless you want to be flooded with TCP level messages !)
Let's run the server!
docker run -d --name=tc --network=host \
-v ~/.tc.dev.log4j.properties:/root/.tc.dev.log4j.properties \
-v ~/.tc/tc-config-docker.xml:/tc-config.xml \
-v BIGMEMORY_LOCATION/:/bmm openjdk:8-stretch \
/bmm/server/bin/start-tc-server.sh \
-f /tc-config.xml
To watch the logs:
docker logs -f tc
To stop it, we'll use:
docker rm -f tc
But before stopping, we need a client example, and hopefully enough the one provided by the kit (under BIGMEMORY_LOCATION/code-samples/src/main/java/com/bigmemory/samples/configfile/ConfigFileBasedBigMemoryConfiguration.java
) is almost perfect for our example; we'll just load it up in an IDE, set the classpath properly and add some "put" statements and some "sleep" too in a while loop !
public class ConfigFileBasedBigMemoryConfiguration {
/**
* Run a test with BigMemory, configured with an xml configuration file
*
* @throws java.io.IOException
*/
public static void main(String[] args) throws IOException {
System.out.println("**** Retrieve config from xml ****");
CacheManager manager = CacheManager.newInstance(ConfigFileBasedBigMemoryConfiguration.class.getResource("/xml/ehcache.xml"));
try {
Cache bigMemory = manager.getCache("bigMemory");
//bigMemory is now ready.
//Successfully retrieve from the file
System.out .println("**** Successfully created - Config file based **** ");
for (int i = 0; i < 2000; i++) {
bigMemory.put(new Element("key" + i, "value" + i));
System.out.println("put key : " + i);
Thread.sleep(1000);
}
} finally {
if (manager != null) manager.shutdown();
}
}
}
Since this sample is loading an XML configuration file, let's have a look at it, and edit it to change the server url, with the IP address of the host that will be running our tc server :
<ehcache xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="http://ehcache.org/ehcache.xsd"
name="config">
<cache name="bigMemory"
maxBytesLocalHeap="128M"
copyOnRead="true" eternal="true">
<terracotta/>
</cache>
<terracottaConfig url="192.168.99.100:9510" rejoin="false"/>
</ehcache>
Simulating a long pause (Garbage Collection) in the client
To simulate a long GC (or a any event that almost freezes the client), we'll run the client code in debug mode and stop all threads at a breakpoint the loop.
Let's run the client "normally" for few seconds , then let's pause it and look at the server logs :
2019-06-07 14:11:40,779 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:11:40,780 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ ALIVE ] ===> State[ AWAIT_PINGREPLY ]
2019-06-07 14:11:41,781 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 1 times (max allowed:3).
2019-06-07 14:11:41,781 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:11:42,782 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 2 times (max allowed:3).
2019-06-07 14:11:42,782 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:11:43,782 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 3 times (max allowed:3).
2019-06-07 14:11:43,783 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ AWAIT_PINGREPLY ] ===> State[ SOCKET_CONNECT ]
2019-06-07 14:11:43,784 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect triggered for 192.168.99.1:57068(callbackport:57067)
2019-06-07 14:11:43,784 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect Context state change for 192.168.99.1:57068(callbackport:57067) : State[ SOCKETCONNECT_IDLE ] ===> State[ SOCKETCONNECT_IN_PROGRESS ]
2019-06-07 14:11:43,789 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect Context state change for 192.168.99.1:57068(callbackport:57067) : State[ SOCKETCONNECT_IN_PROGRESS ] ===> State[ SOCKETCONNECT_IDLE ]
2019-06-07 14:11:43,789 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - 192.168.99.1:57068 might be in Long GC. Ping-probe cycles completed since last reply : 1
2019-06-07 14:11:43,789 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ SOCKET_CONNECT ] ===> State[ ALIVE ]
2019-06-07 14:11:44,785 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:11:44,785 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ ALIVE ] ===> State[ AWAIT_PINGREPLY ]
2019-06-07 14:11:45,786 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 1 times (max allowed:3).
2019-06-07 14:11:45,786 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:11:46,786 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 2 times (max allowed:3).
2019-06-07 14:11:46,786 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:11:47,787 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 3 times (max allowed:3).
2019-06-07 14:11:47,787 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ AWAIT_PINGREPLY ] ===> State[ SOCKET_CONNECT ]
2019-06-07 14:11:47,787 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect triggered for 192.168.99.1:57068(callbackport:57067)
2019-06-07 14:11:47,788 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect Context state change for 192.168.99.1:57068(callbackport:57067) : State[ SOCKETCONNECT_IDLE ] ===> State[ SOCKETCONNECT_IN_PROGRESS ]
2019-06-07 14:11:47,791 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect Context state change for 192.168.99.1:57068(callbackport:57067) : State[ SOCKETCONNECT_IN_PROGRESS ] ===> State[ SOCKETCONNECT_IDLE ]
2019-06-07 14:11:47,791 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - 192.168.99.1:57068 might be in Long GC. Ping-probe cycles completed since last reply : 2
2019-06-07 14:11:47,792 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ SOCKET_CONNECT ] ===> State[ ALIVE ]
2019-06-07 14:11:48,789 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:11:48,789 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ ALIVE ] ===> State[ AWAIT_PINGREPLY ]
2019-06-07 14:11:49,790 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 1 times (max allowed:3).
2019-06-07 14:11:49,790 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:11:50,791 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 2 times (max allowed:3).
2019-06-07 14:11:50,791 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:11:51,792 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 3 times (max allowed:3).
2019-06-07 14:11:51,792 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ AWAIT_PINGREPLY ] ===> State[ SOCKET_CONNECT ]
2019-06-07 14:11:51,793 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect triggered for 192.168.99.1:57068(callbackport:57067)
2019-06-07 14:11:51,793 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect Context state change for 192.168.99.1:57068(callbackport:57067) : State[ SOCKETCONNECT_IDLE ] ===> State[ SOCKETCONNECT_IN_PROGRESS ]
2019-06-07 14:11:51,795 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect Context state change for 192.168.99.1:57068(callbackport:57067) : State[ SOCKETCONNECT_IN_PROGRESS ] ===> State[ SOCKETCONNECT_IDLE ]
2019-06-07 14:11:51,795 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - 192.168.99.1:57068 might be in Long GC. Ping-probe cycles completed since last reply : 3
2019-06-07 14:11:51,795 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ SOCKET_CONNECT ] ===> State[ ALIVE ]
2019-06-07 14:11:52,794 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:11:52,795 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ ALIVE ] ===> State[ AWAIT_PINGREPLY ]
2019-06-07 14:11:53,796 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 1 times (max allowed:3).
2019-06-07 14:11:53,796 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:11:54,798 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 2 times (max allowed:3).
2019-06-07 14:11:54,798 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:11:55,799 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 3 times (max allowed:3).
2019-06-07 14:11:55,799 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ AWAIT_PINGREPLY ] ===> State[ SOCKET_CONNECT ]
2019-06-07 14:11:55,801 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect triggered for 192.168.99.1:57068(callbackport:57067)
2019-06-07 14:11:55,801 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect Context state change for 192.168.99.1:57068(callbackport:57067) : State[ SOCKETCONNECT_IDLE ] ===> State[ SOCKETCONNECT_IN_PROGRESS ]
2019-06-07 14:11:55,805 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect Context state change for 192.168.99.1:57068(callbackport:57067) : State[ SOCKETCONNECT_IN_PROGRESS ] ===> State[ SOCKETCONNECT_IDLE ]
2019-06-07 14:11:55,806 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - 192.168.99.1:57068 might be in Long GC. Ping-probe cycles completed since last reply : 4
2019-06-07 14:11:55,806 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ SOCKET_CONNECT ] ===> State[ ALIVE ]
2019-06-07 14:11:56,802 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:11:56,802 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ ALIVE ] ===> State[ AWAIT_PINGREPLY ]
2019-06-07 14:11:57,803 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 1 times (max allowed:3).
2019-06-07 14:11:57,803 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:11:58,804 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 2 times (max allowed:3).
2019-06-07 14:11:58,804 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:11:59,805 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 3 times (max allowed:3).
2019-06-07 14:11:59,806 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ AWAIT_PINGREPLY ] ===> State[ SOCKET_CONNECT ]
2019-06-07 14:11:59,806 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect triggered for 192.168.99.1:57068(callbackport:57067)
2019-06-07 14:11:59,811 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect Context state change for 192.168.99.1:57068(callbackport:57067) : State[ SOCKETCONNECT_IDLE ] ===> State[ SOCKETCONNECT_IN_PROGRESS ]
2019-06-07 14:11:59,816 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect Context state change for 192.168.99.1:57068(callbackport:57067) : State[ SOCKETCONNECT_IN_PROGRESS ] ===> State[ SOCKETCONNECT_IDLE ]
2019-06-07 14:11:59,816 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - 192.168.99.1:57068 might be in Long GC. Ping-probe cycles completed since last reply : 5
2019-06-07 14:11:59,817 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ SOCKET_CONNECT ] ===> State[ ALIVE ]
2019-06-07 14:12:00,812 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:12:00,813 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ ALIVE ] ===> State[ AWAIT_PINGREPLY ]
2019-06-07 14:12:01,814 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 1 times (max allowed:3).
2019-06-07 14:12:01,814 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:12:02,816 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 2 times (max allowed:3).
2019-06-07 14:12:02,816 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:12:03,818 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 3 times (max allowed:3).
2019-06-07 14:12:03,818 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ AWAIT_PINGREPLY ] ===> State[ SOCKET_CONNECT ]
2019-06-07 14:12:03,818 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect triggered for 192.168.99.1:57068(callbackport:57067)
2019-06-07 14:12:03,819 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect Context state change for 192.168.99.1:57068(callbackport:57067) : State[ SOCKETCONNECT_IDLE ] ===> State[ SOCKETCONNECT_IN_PROGRESS ]
2019-06-07 14:12:03,820 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect Context state change for 192.168.99.1:57068(callbackport:57067) : State[ SOCKETCONNECT_IN_PROGRESS ] ===> State[ SOCKETCONNECT_IDLE ]
2019-06-07 14:12:03,821 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - 192.168.99.1:57068 might be in Long GC. Ping-probe cycles completed since last reply : 6
2019-06-07 14:12:03,822 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ SOCKET_CONNECT ] ===> State[ ALIVE ]
2019-06-07 14:12:04,819 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:12:04,820 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ ALIVE ] ===> State[ AWAIT_PINGREPLY ]
2019-06-07 14:12:05,821 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 1 times (max allowed:3).
2019-06-07 14:12:05,821 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:12:06,822 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 2 times (max allowed:3).
2019-06-07 14:12:06,823 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:12:07,823 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 3 times (max allowed:3).
2019-06-07 14:12:07,824 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ AWAIT_PINGREPLY ] ===> State[ SOCKET_CONNECT ]
2019-06-07 14:12:07,824 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect triggered for 192.168.99.1:57068(callbackport:57067)
2019-06-07 14:12:07,824 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect Context state change for 192.168.99.1:57068(callbackport:57067) : State[ SOCKETCONNECT_IDLE ] ===> State[ SOCKETCONNECT_IN_PROGRESS ]
2019-06-07 14:12:07,826 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect Context state change for 192.168.99.1:57068(callbackport:57067) : State[ SOCKETCONNECT_IN_PROGRESS ] ===> State[ SOCKETCONNECT_IDLE ]
2019-06-07 14:12:07,826 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - 192.168.99.1:57068 might be in Long GC. Ping-probe cycles completed since last reply : 7
2019-06-07 14:12:07,826 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ SOCKET_CONNECT ] ===> State[ ALIVE ]
2019-06-07 14:12:08,825 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:12:08,825 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ ALIVE ] ===> State[ AWAIT_PINGREPLY ]
2019-06-07 14:12:09,826 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 1 times (max allowed:3).
2019-06-07 14:12:09,826 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:12:10,827 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 2 times (max allowed:3).
2019-06-07 14:12:10,827 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:12:11,830 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 3 times (max allowed:3).
2019-06-07 14:12:11,830 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ AWAIT_PINGREPLY ] ===> State[ SOCKET_CONNECT ]
2019-06-07 14:12:11,830 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect triggered for 192.168.99.1:57068(callbackport:57067)
2019-06-07 14:12:11,830 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect Context state change for 192.168.99.1:57068(callbackport:57067) : State[ SOCKETCONNECT_IDLE ] ===> State[ SOCKETCONNECT_IN_PROGRESS ]
2019-06-07 14:12:11,834 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect Context state change for 192.168.99.1:57068(callbackport:57067) : State[ SOCKETCONNECT_IN_PROGRESS ] ===> State[ SOCKETCONNECT_IDLE ]
2019-06-07 14:12:11,836 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - 192.168.99.1:57068 might be in Long GC. Ping-probe cycles completed since last reply : 8
2019-06-07 14:12:11,836 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ SOCKET_CONNECT ] ===> State[ ALIVE ]
2019-06-07 14:12:12,831 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:12:12,832 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ ALIVE ] ===> State[ AWAIT_PINGREPLY ]
2019-06-07 14:12:13,833 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 1 times (max allowed:3).
2019-06-07 14:12:13,833 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:12:14,834 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 2 times (max allowed:3).
2019-06-07 14:12:14,834 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:12:15,835 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 3 times (max allowed:3).
2019-06-07 14:12:15,835 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ AWAIT_PINGREPLY ] ===> State[ SOCKET_CONNECT ]
2019-06-07 14:12:15,836 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect triggered for 192.168.99.1:57068(callbackport:57067)
2019-06-07 14:12:15,836 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect Context state change for 192.168.99.1:57068(callbackport:57067) : State[ SOCKETCONNECT_IDLE ] ===> State[ SOCKETCONNECT_IN_PROGRESS ]
2019-06-07 14:12:15,838 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect Context state change for 192.168.99.1:57068(callbackport:57067) : State[ SOCKETCONNECT_IN_PROGRESS ] ===> State[ SOCKETCONNECT_IDLE ]
2019-06-07 14:12:15,838 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - 192.168.99.1:57068 might be in Long GC. Ping-probe cycles completed since last reply : 9
2019-06-07 14:12:15,838 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ SOCKET_CONNECT ] ===> State[ ALIVE ]
2019-06-07 14:12:16,836 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:12:16,837 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ ALIVE ] ===> State[ AWAIT_PINGREPLY ]
2019-06-07 14:12:17,838 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 1 times (max allowed:3).
2019-06-07 14:12:17,838 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:12:18,839 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 2 times (max allowed:3).
2019-06-07 14:12:18,839 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:57068
2019-06-07 14:12:19,840 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:57068 for 3 times (max allowed:3).
2019-06-07 14:12:19,840 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ AWAIT_PINGREPLY ] ===> State[ SOCKET_CONNECT ]
2019-06-07 14:12:19,840 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect triggered for 192.168.99.1:57068(callbackport:57067)
2019-06-07 14:12:19,841 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect Context state change for 192.168.99.1:57068(callbackport:57067) : State[ SOCKETCONNECT_IDLE ] ===> State[ SOCKETCONNECT_IN_PROGRESS ]
2019-06-07 14:12:19,842 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect Context state change for 192.168.99.1:57068(callbackport:57067) : State[ SOCKETCONNECT_IN_PROGRESS ] ===> State[ SOCKETCONNECT_IDLE ]
2019-06-07 14:12:19,842 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] ERROR com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - 192.168.99.1:57068 might be in Long GC. Ping-probe cycles completed since last reply : 10. But its too long. No more retries
2019-06-07 14:12:19,842 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:57068 : State[ SOCKET_CONNECT ] ===> State[ DEAD ]
2019-06-07 14:12:20,841 [HealthChecker] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - 192.168.99.1:57068 is DEAD
2019-06-07 14:12:20,841 [HealthChecker] ERROR com.tc.net.protocol.transport.ConnectionHealthCheckerImpl: TSA Server - Declared connection dead ConnectionID(0.80eb7f8012ff4a6895bc9d7a00af78d0.f4038219-2045-477b-97a4-d167deb95338-16b3246f6a4.USER)[] idle time 45535ms
2019-06-07 14:12:20,841 [HealthChecker] DEBUG com.tc.net.protocol.transport.ServerMessageTransport - Changing from ESTABLISHED to DISCONNECTED
We can see that the server, after noticing the client was idle for more than 5 seconds (l2.healthcheck.l1.ping.idletime = 5000
), started probing it at 14:11:40,779, 3 times (l2.healthcheck.l1.ping.probes = 3
) each separated by 1 second (l2.healthcheck.l1.ping.interval = 1000
); 3 seconds passed by.
But since it got no reply, it tried to connect a socket (l2.healthcheck.l1.socketConnect = true
) at 14:11:43,784 and it was successful doing so; so it retried probing (3 seconds) after 1 second at 14:11:44,785 - since l2.healthcheck.l1.socketConnectCount = 10
, it tried 10 times in total ! (10 * [3+1]) are 40 seconds before considering the client fatally not responsive anymore at 14:12:19,842
All in all, as stated in the documentation:
Max Time = (ping.idletime) + socketConnectCount * [(ping.interval * ping.probes) + ping.interval]
Ok so we could prove the "will allow a 40 second GC in the L1" statement, but what about "will detect a network disconnect (like a cable pull) in 10 seconds" ?
Simulating a network disconnect
For this one, we'll start fresh: new Terracotta server run, new client run (not in debug mode this time).
After some seconds running, let's bring down the Terracotta Server network interface (in my case the VM network interface) :
sudo ifconfig vboxnet0 down
Let's wait a minute or two before putting it back up :
sudo ifconfig vboxnet0 up
and now let's have a look at the server logs :
2019-06-07 16:06:01,511 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:51412
2019-06-07 16:06:01,512 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:51412 : State[ START ] ===> State[ AWAIT_PINGREPLY ]
2019-06-07 16:06:02,512 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:51412 for 1 times (max allowed:3).
2019-06-07 16:06:02,513 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:51412
2019-06-07 16:06:03,513 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:51412 for 2 times (max allowed:3).
2019-06-07 16:06:03,514 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Sending PING Probe to IDLE 192.168.99.1:51412
2019-06-07 16:06:04,515 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - PING_REPLY not received from 192.168.99.1:51412 for 3 times (max allowed:3).
2019-06-07 16:06:04,515 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:51412 : State[ AWAIT_PINGREPLY ] ===> State[ SOCKET_CONNECT ]
2019-06-07 16:06:04,516 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect triggered for 192.168.99.1:51412(callbackport:51411)
2019-06-07 16:06:04,517 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect Context state change for 192.168.99.1:51412(callbackport:51411) : State[ SOCKETCONNECT_IDLE ] ===> State[ SOCKETCONNECT_IN_PROGRESS ]
2019-06-07 16:06:05,517 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect to 192.168.99.1:51412(callbackport:51411) listener port in progress.
2019-06-07 16:06:06,518 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect to 192.168.99.1:51412(callbackport:51411) listener port in progress.
2019-06-07 16:06:07,519 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect to 192.168.99.1:51412(callbackport:51411) listener port in progress.
2019-06-07 16:06:08,520 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect to 192.168.99.1:51412(callbackport:51411) listener port in progress.
2019-06-07 16:06:09,521 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect to 192.168.99.1:51412(callbackport:51411) listener port in progress.
2019-06-07 16:06:10,522 [HealthChecker] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect to 192.168.99.1:51412(callbackport:51411) taking long time. probably not reachable.
2019-06-07 16:06:10,522 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Socket Connect Context state change for 192.168.99.1:51412(callbackport:51411) : State[ SOCKETCONNECT_IN_PROGRESS ] ===> State[ SOCKETCONNECT_FAIL ]
2019-06-07 16:06:10,522 [HealthChecker] DEBUG com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - Context state change for 192.168.99.1:51412 : State[ SOCKET_CONNECT ] ===> State[ DEAD ]
2019-06-07 16:06:10,522 [HealthChecker] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TSA Server - 192.168.99.1:51412 is DEAD
2019-06-07 16:06:10,524 [HealthChecker] ERROR com.tc.net.protocol.transport.ConnectionHealthCheckerImpl: TSA Server - Declared connection dead ConnectionID(0.b36771ddb18644f6abbfb460445c496f.8f02e3e2-8a88-49a4-a44c-8898b3c6fa93-16b32aeb23b.USER)[] idle time 14693ms
2019-06-07 16:06:10,524 [HealthChecker] DEBUG com.tc.net.protocol.transport.ServerMessageTransport - Changing from ESTABLISHED to DISCONNECTED
When the server found out the client was idle, it started probing at 16:06:01,511, and same as before, decided to connect a socket after 3 tries and 1 second at 16:06:05,517; but this time the socket could not connect after 5 seconds (l2.healthcheck.l1.socketConnectTimeout = 5
* l2.healthcheck.l1.ping.interval = 1000
); so the server noticed it was taking too long at 16:06:10,522 and considered the connection dead.
So we have a total of 9 seconds to consider the network disconnect fatal; pretty close to 10 !
Max Time = (ping.idletime) + 1 * [(ping.interval * ping.probes) + ping.interval] + (socketConnectTimeout * ping.interval)
I hope that this investigation could you help you understand more about how the communication between Ehcache 2 and the Terracotta server is health checked ! (you'll notice the official documentation added the ping.idletime
in its calculation; it's valid too !).
Thanks to my colleagues Mojtaba for the idea and Myron for his help !