Webmethods 8 cluster

helge.heggli · February 3, 2012, 7:44pm

Hi

I am having issues setting up wm8 cluster… i have followed the guide and confirmed the multicast ip test . everything looks good, i can create a cluster on both machines, but neither can join the cluster while the other server is up.

has anyone here seen this error at startup, and can provide some help?

[TABLE=“class: tableView, width: 0”]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1310]2012-02-03 15:08:13 CET [ISS.0025.0008I] State Manager started[/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1309]); – check configuration[/TD]
[/TR]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1308] )[/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1307] MemberId/ServiceVersion/ServiceJoined/ServiceLeaving[/TD]
[/TR]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1306] )[/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1305] ActualMemberSet=MemberSet(Size=0, BitSetCount=0[/TD]
[/TR]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1304] OldestMember=n/a[/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1303] MemberSet=ServiceMemberSet([/TD]
[/TR]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1302]2012-02-03 15:08:13 CET [ISS.0033.0103W] Could not open the session cache. com.webMethods.sc.caching.CachingException: Failed to create/get distributed cache ‘ClusterMembers’; reason: com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(Id=0, Name=Cluster, Type=Cluster[/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1301]); – check configuration[/TD]
[/TR]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1300] )[/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1299] MemberId/ServiceVersion/ServiceJoined/ServiceLeaving[/TD]
[/TR]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1298] )[/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1297] ActualMemberSet=MemberSet(Size=0, BitSetCount=0[/TD]
[/TR]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1296] OldestMember=n/a[/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1295] MemberSet=ServiceMemberSet([/TD]
[/TR]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1294]2012-02-03 15:08:13 CET [ISS.0033.0104I] Could not add server to the cluster. com.webMethods.sc.caching.CachingException: Failed to create/get distributed cache ‘ClusterMembers’; reason: com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(Id=0, Name=Cluster, Type=Cluster[/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1293]2012-02-03 15:08:13 CET [ISS.0033.0161E] CLUSTERING DISABLED: Could not create or join distributed clustering cache “ClusterMembers”. This server is NOT a member of a cluster.[/TD]
[/TR]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1292]); – check configuration[/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1291] )[/TD]
[/TR]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1290] MemberId/ServiceVersion/ServiceJoined/ServiceLeaving[/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1289] )[/TD]
[/TR]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1288] ActualMemberSet=MemberSet(Size=0, BitSetCount=0[/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1287] OldestMember=n/a[/TD]
[/TR]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1286] MemberSet=ServiceMemberSet([/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1285]2012-02-03 15:08:13 CET [WmSharedCacheSC.impl.0200E] Failed to create/get distributed cache ‘ClusterMembers’; reason: com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(Id=0, Name=Cluster, Type=Cluster[/TD]
[/TR]
[TR]
[TD=“class: oddrow-l, bgcolor: #E0E0C0, align: left”][1284]2012-02-03 15:07:42 CET [WmSharedCacheSC.config.0595I] Reading cache configuration: distributedCache[/TD]
[/TR]
[TR]
[TD=“class: evenrow-l, bgcolor: #F0F0E0, align: left”][1283]2012-02-03 15:07:42 CET [ISS.0033.0168C] Cluster Node Name: QAWMLOADIS3.dev.site.[/TD]
[/TR]
[/TABLE]

Tong_Wang · February 3, 2012, 9:21pm

Seems it can’t reach the distributed cache, in cluster guide, check the section:
“What Happens When Integration Server Cannot Connect to the Distributed Cache?”

Also make sure you network allow multicast for port you specified.

gupta_r.17495 · February 3, 2012, 10:08pm

What IS 8.x version are you facing this? 8.2 SP2?

helge.heggli · February 6, 2012, 2:58pm

Hi, thanx for the replies. I have run the update manager so my guess is 8.2 SP2, the admin page says 8.2.2

I have run the multicast test between the servers and it turns out fine. example of test:

Configuring multicast socket…
Starting listener…
Fri Feb 03 14:58:29 CET 2012: Sent packet 1.
Fri Feb 03 14:58:29 CET 2012: Received test packet 1 from self (sent 10ms ago).
Fri Feb 03 14:58:30 CET 2012: Received test packet 11 from ip=QAWMLOADIS4/192.16
8.180.212, group=/232.206.12.108:24547, ttl=4.
Fri Feb 03 14:58:31 CET 2012: Sent packet 2.
Fri Feb 03 14:58:31 CET 2012: Received test packet 2 from self
Fri Feb 03 14:58:32 CET 2012: Received test packet 12 from ip=QAWMLOADIS4/192.16
8.180.212, group=/232.206.12.108:24547, ttl=4.
Fri Feb 03 14:58:33 CET 2012: Sent packet 3.
Fri Feb 03 14:58:33 CET 2012: Received test packet 3 from self
Fri Feb 03 14:58:34 CET 2012: Received test packet 13 from ip=QAWMLOADIS4/192.16
8.180.212, group=/232.206.12.108:24547, ttl=4.
Fri Feb 03 14:58:35 CET 2012: Sent packet 4.
Fri Feb 03 14:58:35 CET 2012: Received test packet 4 from self
Fri Feb 03 14:58:36 CET 2012: Received test packet 14 from ip=QAWMLOADIS4/192.16
8.180.212, group=/232.206.12.108:24547, ttl=4.

And the following scenarios are tested:

server 1 in cluster - works fine
server 2 tries to join - server two fails and disables cluster

then i replicate by creating cluster on server 2, which creates a cluster that server 1 cant join.

i am licenced for cluster and distributed cache so that shouldnt create problems.

gupta_r.17495 · February 7, 2012, 12:09am

OK 8.2.2 then:

I believe you have choosen coherence and not Terracotta while setup?

Also did you check with network folks and any pointers from SAG support any cluster fix?

helge.heggli · February 13, 2012, 1:41pm

Hi

Yes I am using coherence! SAG support is involved but they haevnt been able to fix it.

Nanda_Reddy · April 23, 2012, 4:15pm

Hello,

I have been trying to configure IS clustering on wM8.2 SP2 (8.2.2.0) and facing the same issues. Can you please let me know the status of the above issue. Would take necessary action based on that.

Thanks,
Nanda.

gupta_r.17495 · April 23, 2012, 7:34pm

What issue are you having with cluster setup?..can you elaborate more on the specifics?

This is on what OS?

HTH,
RMG

gupta_r.17495 · April 23, 2012, 7:37pm

Helge,

Any update/resolution from SAG support for you?

Nanda_Reddy · April 23, 2012, 9:11pm

Hello rmg,

Please find brief description of the issue below:

We have two physical servers which hosts a single IS each. These two ISs utilise the same IS Core, Process Engine and Internal JDBC Pools. All the relevant clustering parameters are configured to the same on both the servers.

When the clustering is enabled on these servers, the Admin page of any IS shows that the cluster has been created successfully with two ISs in it. However, after some time (may be 30-45 minutes), the following error is noticed in one of the server logs (say server 2):

2012-04-23 11:13:05 BST [WmSharedCacheSC.impl.0121E] Failed to insert cache entry at key ‘10.246.160.77:9101’; reason: ‘Failed to start Service “Cluster” (ServiceState=SERVICE_STOPPED, STATE_ANNOUNCE)’
2012-04-23 11:15:36 BST [WmSharedCacheSC.impl.0121E] Failed to insert cache entry at key ‘10.246.160.77:9101’; reason: ‘Failed to start Service “Cluster” (ServiceState=SERVICE_STOPPED, STATE_ANNOUNCE)’

During this time, the server 2 is irresponsive as well (Admin page doesn’t load up on the server which is throing these errors).

Once this faulty server is restarted, the following errors are seen during the startup (while loading the distributed Cache and while loading the WmPRT package):

2012-04-23 11:16:56 BST [ISS.0025.0021I] ACL Manager started
2012-04-23 11:16:57 BST [ISS.0033.0168C] Cluster Node Name: server2.
2012-04-23 11:16:57 BST [WmSharedCacheSC.config.0595I] Reading cache configuration: distributedCache
2012-04-23 11:17:28 BST [WmSharedCacheSC.impl.0200E] Failed to create/get distributed cache ‘ClusterMembers’; reason: com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(Id=0, Name=Cluster, Type=Cluster
MemberSet=ServiceMemberSet(
OldestMember=n/a
ActualMemberSet=MemberSet(Size=0, BitSetCount=0
)
MemberId/ServiceVersion/ServiceJoined/ServiceLeaving
)
); – check configuration
2012-04-23 11:17:28 BST [ISS.0033.0161E] CLUSTERING DISABLED: Could not create or join distributed clustering cache “ClusterMembers”. This server is NOT a member of a cluster.
2012-04-23 11:17:28 BST [ISS.0033.0104I] Could not add server to the cluster. com.webMethods.sc.caching.CachingException: Failed to create/get distributed cache ‘ClusterMembers’; reason: com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(Id=0, Name=Cluster, Type=Cluster
MemberSet=ServiceMemberSet(
OldestMember=n/a
ActualMemberSet=MemberSet(Size=0, BitSetCount=0
)
MemberId/ServiceVersion/ServiceJoined/ServiceLeaving
)
); – check configuration
2012-04-23 11:17:28 BST [ISS.0033.0103W] Could not open the session cache. com.webMethods.sc.caching.CachingException: Failed to create/get distributed cache ‘ClusterMembers’; reason: com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(Id=0, Name=Cluster, Type=Cluster
MemberSet=ServiceMemberSet(
OldestMember=n/a
ActualMemberSet=MemberSet(Size=0, BitSetCount=0
)
MemberId/ServiceVersion/ServiceJoined/ServiceLeaving
)
); – check configuration
2012-04-23 11:17:28 BST [ISS.0025.0008I] State Manager started
2012-04-23 11:17:28 BST [ISS.0025.0010I] Service Manager started
2012-04-23 11:17:28 BST [ISS.0025.0020I] Validation Processor started
2012-04-23 11:17:28 BST [ISS.0025.0022I] Statistics Processor started
2012-04-23 11:17:28 BST [ISS.0025.0018I] Invoke Manager started
2012-04-23 11:17:28 BST [ISS.0025.0012I] Cache Manager started

The other thing noticed for the issue is I have seen is to reconfigure the cluster on both the servers and then restart. However, this will be running only for a few minutes (again starts throwing the errors that I mentioned earlier).

wM version: wM 8.2 SP2 (8.2.2.0)
OS: Red Hat Enterprise Linux Server release 5.7

Thanks again,
Nanda.

gupta_r.17495 · April 23, 2012, 9:25pm

I do see some fixes to Coherence config file listed on Empower for the previous 8.x releases…Did you check it and may be have SAG support also involved for this fixed?

Also please check this below (KB #: 1728115) this is for TNQuery but similar error:

The cause of this is the distributed caching and the following fix should resolve the issue. Open “IntegrationServer\config\Caching\Tangosol-coherence-override.xml” file and override the multicast address settings as described below: 224.0.0.0 </ multicast-listener> Restart the Integration Server.

HTH,
RMG

Nanda_Reddy · April 23, 2012, 10:54pm

rmg;83784:

I do see some fixes to Coherence config file listed on Empower for the previous 8.x releases…Did you check it and may be have SAG support also involved for this fixed?

Also please check this below (KB #: 1728115) this is for TNQuery but similar error:

The cause of this is the distributed caching and the following fix should resolve the issue. Open “IntegrationServer\config\Caching\Tangosol-coherence-override.xml” file and override the multicast address settings as described below: 224.0.0.0 </ multicast-listener> Restart the Integration Server.

Hello rmg,

Thanks for the above.

The issue has been raised to SoftwareAG and I’m waiting to listen from them. However, I would try top see if the above fix resolves the issue.

Apart from this, my user account with empower is under modification (due to organization changes) and so couldn’t take a look.

Thanks,
Nanda.

HTH,
RMG

gupta_r.17495 · April 23, 2012, 11:02pm

Please have your network team also involve with the timeout thing for multicast ip or port related issues may causing…I think SAG is the best to have some fix relased for this issue:

Nanda_Reddy · April 30, 2012, 4:14pm

Hello RMG,

Thanks for all your help with regards to this issue. We have been recommended by SAG to install the IS Core Fix1 (mandatory) and / or IS Core Fix2 (optional) as part of the resolution. However, the readme’s for both these fixes do not make any specific mention about the clustering issues (if any).

These have been installed and the clustering seems to be going through fine (running fine since 20 hrs approx).

Thanks,
Nanda.

Luis_Mesquita · May 16, 2012, 11:17pm

Hey did you solve this? I’m having the same problem

Do we need to start any server or something or we just need to configure the cluster on IS Admin Clustering.

They are pointing at the same broker and they are an exact copy don’t really get it…

gupta_r.17495 · May 16, 2012, 11:57pm

Luis,

It seems the above core fixes (for 8.2.2) resolved the clustering issue:

Nanda,Are there any more issues you noticed since you last installed the fixes?

HTH,
RMg

Luis_Mesquita · May 17, 2012, 8:42am

Thing is I’m using 8.2.2SP2 suposelly those fixes are already included!

I have to check it.

Senthilkumar_G · May 17, 2012, 11:06am

Issue with the firewall also can cause this problem. Firewall in one system being active and another system inactive is one such case… Did you execute multicase test and datagram test?

-Senthil

Luis_Mesquita · May 17, 2012, 3:22pm

How do I perform such testing nothing incuded on the docs… I’ve noticied I don’t have license for distribuited cahce although I have for clustering maybe that’s the problem??

[ATTACH=CONFIG]852[/ATTACH]

Nanda_Reddy · May 17, 2012, 4:38pm

I haven’t seen any issues after the installation of IS Core Fix1. The SAG support team refused to accept the issue without this Core fix on the server and the issue was not seen once this fix was installed.

IS Core Fix1 is “not” included in 8.2 SP2 (8.2.2). You would need to download this specifically using update manager. This has some fixes related to deployer, Tomcat and other components.