Webmethods 8 cluster

Hi

I am having issues setting up wm8 cluster… i have followed the guide and confirmed the multicast ip test . everything looks good, i can create a cluster on both machines, but neither can join the cluster while the other server is up.

has anyone here seen this error at startup, and can provide some help?

[TABLE=“class: tableView, width: 0”]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1310]2012-02-03 15:08:13 CET [ISS.0025.0008I] State Manager started[/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1309]); – check configuration[/TD]
[/TR]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1308] )[/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1307] MemberId/ServiceVersion/ServiceJoined/ServiceLeaving[/TD]
[/TR]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1306] )[/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1305] ActualMemberSet=MemberSet(Size=0, BitSetCount=0[/TD]
[/TR]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1304] OldestMember=n/a[/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1303] MemberSet=ServiceMemberSet([/TD]
[/TR]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1302]2012-02-03 15:08:13 CET [ISS.0033.0103W] Could not open the session cache. com.webMethods.sc.caching.CachingException: Failed to create/get distributed cache ‘ClusterMembers’; reason: com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(Id=0, Name=Cluster, Type=Cluster[/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1301]); – check configuration[/TD]
[/TR]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1300] )[/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1299] MemberId/ServiceVersion/ServiceJoined/ServiceLeaving[/TD]
[/TR]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1298] )[/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1297] ActualMemberSet=MemberSet(Size=0, BitSetCount=0[/TD]
[/TR]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1296] OldestMember=n/a[/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1295] MemberSet=ServiceMemberSet([/TD]
[/TR]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1294]2012-02-03 15:08:13 CET [ISS.0033.0104I] Could not add server to the cluster. com.webMethods.sc.caching.CachingException: Failed to create/get distributed cache ‘ClusterMembers’; reason: com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(Id=0, Name=Cluster, Type=Cluster[/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1293]2012-02-03 15:08:13 CET [ISS.0033.0161E] CLUSTERING DISABLED: Could not create or join distributed clustering cache “ClusterMembers”. This server is NOT a member of a cluster.[/TD]
[/TR]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1292]); – check configuration[/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1291] )[/TD]
[/TR]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1290] MemberId/ServiceVersion/ServiceJoined/ServiceLeaving[/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1289] )[/TD]
[/TR]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1288] ActualMemberSet=MemberSet(Size=0, BitSetCount=0[/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1287] OldestMember=n/a[/TD]
[/TR]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1286] MemberSet=ServiceMemberSet([/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1285]2012-02-03 15:08:13 CET [WmSharedCacheSC.impl.0200E] Failed to create/get distributed cache ‘ClusterMembers’; reason: com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(Id=0, Name=Cluster, Type=Cluster[/TD]
[/TR]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1284]2012-02-03 15:07:42 CET [WmSharedCacheSC.config.0595I] Reading cache configuration: distributedCache[/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1283]2012-02-03 15:07:42 CET [ISS.0033.0168C] Cluster Node Name: QAWMLOADIS3.dev.site.[/TD]
[/TR]
[/TABLE]

Seems it can’t reach the distributed cache, in cluster guide, check the section:
“What Happens When Integration Server Cannot Connect to the Distributed Cache?”

Also make sure you network allow multicast for port you specified.

What IS 8.x version are you facing this? 8.2 SP2?

Hi, thanx for the replies. I have run the update manager so my guess is 8.2 SP2, the admin page says 8.2.2

I have run the multicast test between the servers and it turns out fine. example of test:

Configuring multicast socket…
Starting listener…
Fri Feb 03 14:58:29 CET 2012: Sent packet 1.
Fri Feb 03 14:58:29 CET 2012: Received test packet 1 from self (sent 10ms ago).
Fri Feb 03 14:58:30 CET 2012: Received test packet 11 from ip=QAWMLOADIS4/192.16
8.180.212, group=/232.206.12.108:24547, ttl=4.
Fri Feb 03 14:58:31 CET 2012: Sent packet 2.
Fri Feb 03 14:58:31 CET 2012: Received test packet 2 from self
Fri Feb 03 14:58:32 CET 2012: Received test packet 12 from ip=QAWMLOADIS4/192.16
8.180.212, group=/232.206.12.108:24547, ttl=4.
Fri Feb 03 14:58:33 CET 2012: Sent packet 3.
Fri Feb 03 14:58:33 CET 2012: Received test packet 3 from self
Fri Feb 03 14:58:34 CET 2012: Received test packet 13 from ip=QAWMLOADIS4/192.16
8.180.212, group=/232.206.12.108:24547, ttl=4.
Fri Feb 03 14:58:35 CET 2012: Sent packet 4.
Fri Feb 03 14:58:35 CET 2012: Received test packet 4 from self
Fri Feb 03 14:58:36 CET 2012: Received test packet 14 from ip=QAWMLOADIS4/192.16
8.180.212, group=/232.206.12.108:24547, ttl=4.

And the following scenarios are tested:

server 1 in cluster - works fine
server 2 tries to join - server two fails and disables cluster

then i replicate by creating cluster on server 2, which creates a cluster that server 1 cant join.

i am licenced for cluster and distributed cache so that shouldnt create problems.

OK 8.2.2 then:

I believe you have choosen coherence and not Terracotta while setup?

Also did you check with network folks and any pointers from SAG support any cluster fix?

Hi

Yes I am using coherence! SAG support is involved but they haevnt been able to fix it.

Hello,

I have been trying to configure IS clustering on wM8.2 SP2 (8.2.2.0) and facing the same issues. Can you please let me know the status of the above issue. Would take necessary action based on that.

Thanks,
Nanda.

What issue are you having with cluster setup?..can you elaborate more on the specifics?

This is on what OS?

HTH,
RMG

Helge,

Any update/resolution from SAG support for you?

Hello rmg,

Please find brief description of the issue below:

We have two physical servers which hosts a single IS each. These two ISs utilise the same IS Core, Process Engine and Internal JDBC Pools. All the relevant clustering parameters are configured to the same on both the servers.

When the clustering is enabled on these servers, the Admin page of any IS shows that the cluster has been created successfully with two ISs in it. However, after some time (may be 30-45 minutes), the following error is noticed in one of the server logs (say server 2):


2012-04-23 11:13:05 BST [WmSharedCacheSC.impl.0121E] Failed to insert cache entry at key ‘10.246.160.77:9101’; reason: ‘Failed to start Service “Cluster” (ServiceState=SERVICE_STOPPED, STATE_ANNOUNCE)’
2012-04-23 11:15:36 BST [WmSharedCacheSC.impl.0121E] Failed to insert cache entry at key ‘10.246.160.77:9101’; reason: ‘Failed to start Service “Cluster” (ServiceState=SERVICE_STOPPED, STATE_ANNOUNCE)’


During this time, the server 2 is irresponsive as well (Admin page doesn’t load up on the server which is throing these errors).

Once this faulty server is restarted, the following errors are seen during the startup (while loading the distributed Cache and while loading the WmPRT package):


2012-04-23 11:16:56 BST [ISS.0025.0021I] ACL Manager started
2012-04-23 11:16:57 BST [ISS.0033.0168C] Cluster Node Name: server2.
2012-04-23 11:16:57 BST [WmSharedCacheSC.config.0595I] Reading cache configuration: distributedCache
2012-04-23 11:17:28 BST [WmSharedCacheSC.impl.0200E] Failed to create/get distributed cache ‘ClusterMembers’; reason: com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(Id=0, Name=Cluster, Type=Cluster
MemberSet=ServiceMemberSet(
OldestMember=n/a
ActualMemberSet=MemberSet(Size=0, BitSetCount=0
)
MemberId/ServiceVersion/ServiceJoined/ServiceLeaving
)
); – check configuration
2012-04-23 11:17:28 BST [ISS.0033.0161E] CLUSTERING DISABLED: Could not create or join distributed clustering cache “ClusterMembers”. This server is NOT a member of a cluster.
2012-04-23 11:17:28 BST [ISS.0033.0104I] Could not add server to the cluster. com.webMethods.sc.caching.CachingException: Failed to create/get distributed cache ‘ClusterMembers’; reason: com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(Id=0, Name=Cluster, Type=Cluster
MemberSet=ServiceMemberSet(
OldestMember=n/a
ActualMemberSet=MemberSet(Size=0, BitSetCount=0
)
MemberId/ServiceVersion/ServiceJoined/ServiceLeaving
)
); – check configuration
2012-04-23 11:17:28 BST [ISS.0033.0103W] Could not open the session cache. com.webMethods.sc.caching.CachingException: Failed to create/get distributed cache ‘ClusterMembers’; reason: com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(Id=0, Name=Cluster, Type=Cluster
MemberSet=ServiceMemberSet(
OldestMember=n/a
ActualMemberSet=MemberSet(Size=0, BitSetCount=0
)
MemberId/ServiceVersion/ServiceJoined/ServiceLeaving
)
); – check configuration
2012-04-23 11:17:28 BST [ISS.0025.0008I] State Manager started
2012-04-23 11:17:28 BST [ISS.0025.0010I] Service Manager started
2012-04-23 11:17:28 BST [ISS.0025.0020I] Validation Processor started
2012-04-23 11:17:28 BST [ISS.0025.0022I] Statistics Processor started
2012-04-23 11:17:28 BST [ISS.0025.0018I] Invoke Manager started
2012-04-23 11:17:28 BST [ISS.0025.0012I] Cache Manager started


The other thing noticed for the issue is I have seen is to reconfigure the cluster on both the servers and then restart. However, this will be running only for a few minutes (again starts throwing the errors that I mentioned earlier).

wM version: wM 8.2 SP2 (8.2.2.0)
OS: Red Hat Enterprise Linux Server release 5.7

Thanks again,
Nanda.

I do see some fixes to Coherence config file listed on Empower for the previous 8.x releases…Did you check it and may be have SAG support also involved for this fixed?

Also please check this below (KB #: 1728115) this is for TNQuery but similar error:

The cause of this is the distributed caching and the following fix should resolve the issue. Open “IntegrationServer\config\Caching\Tangosol-coherence-override.xml” file and override the multicast address settings as described below: 224.0.0.0 </ multicast-listener> Restart the Integration Server.

HTH,
RMG

Please have your network team also involve with the timeout thing for multicast ip or port related issues may causing…I think SAG is the best to have some fix relased for this issue:

Hello RMG,

Thanks for all your help with regards to this issue. We have been recommended by SAG to install the IS Core Fix1 (mandatory) and / or IS Core Fix2 (optional) as part of the resolution. However, the readme’s for both these fixes do not make any specific mention about the clustering issues (if any).

These have been installed and the clustering seems to be going through fine (running fine since 20 hrs approx).

Thanks,
Nanda.

Hey did you solve this? I’m having the same problem :frowning:

Do we need to start any server or something or we just need to configure the cluster on IS Admin Clustering.

They are pointing at the same broker and they are an exact copy don’t really get it…

Luis,

It seems the above core fixes (for 8.2.2) resolved the clustering issue:

Nanda,Are there any more issues you noticed since you last installed the fixes?

HTH,
RMg

Thing is I’m using 8.2.2SP2 suposelly those fixes are already included!

I have to check it.

Issue with the firewall also can cause this problem. Firewall in one system being active and another system inactive is one such case… Did you execute multicase test and datagram test?

-Senthil

How do I perform such testing nothing incuded on the docs… I’ve noticied I don’t have license for distribuited cahce although I have for clustering maybe that’s the problem??

[ATTACH=CONFIG]852[/ATTACH]
license.JPG

I haven’t seen any issues after the installation of IS Core Fix1. The SAG support team refused to accept the issue without this Core fix on the server and the issue was not seen once this fix was installed.

IS Core Fix1 is “not” included in 8.2 SP2 (8.2.2). You would need to download this specifically using update manager. This has some fixes related to deployer, Tomcat and other components.