I have never been a fan of the IS software cluster. The fact that is called a cluster implies some functionality that is just not there, namely failover except in the rare case in which you have written a custom Java client that uses the TContext class instead of the Context class.
Load balancing when the workload of IS comes mostly from HTTP/S traffic is a no-brainer using a hardware load balancer or content switch. However, balancing the workload when it arrives in the form of messages retrieved from legacy message oriented middleware systems or JMS queues/topics is a bit more challenging.
What have you found to be the right approach in those situations?
For legacy MOM facilities, as it is with Broker, there isn’t dynamic load balancing available AFAIK. We’re limited in our choices to essentially statically segregating traffic in some way (e.g. order management via this queue/broker, financials on this other one, etc.) to spread the work. Failover generally provided by OS clustering and shared-mounted-by-active-instance storage.
Perhaps others have experienced or are aware of other approaches?
Ah, I was looking at if from the wrong perspective. What I was describing was LB and failover for the MOM itself, not for the IS instances connecting to it. My bad. There is indeed a way to distribute the work among multiple IS instances when interacting with MOM facilities.
We did this only with MQ Series. It may work similarly with other facilities but I don’t know that for sure.
One can configure multiple IS instances to connect to the same queue/queue manager. Only one of the IS instances will get a message at a time, much like what is done in the interaction with Broker. Thus, one can achieve failover and some degree of load balancing on the IS side when interacting with MQ.
We had 6 IS instances all connected to a single MQ manager. We could take down any of the 6 for any reason and not lose anything. MQ manager, however, was a single point of failure, as at the time there was no facility to cluster that (other than OS cluster and shared storage as mentioned earlier).
But isn’t webMethods stating the ‘PRT’ needs to be in a cluster to ensure documents are not processed multiple times?
I think I understand what you are saying, you trust the multiple IS instances to play with the shared queue nicely…
But what happens if an IS instance picks up an event, doesn’t ack, crashes… isn’t another IS instance going to pick up that same event? then if/when the first IS restarts, it will process the event creating a duplicate.
Or am I totally missing something?
I like Mark am not happy with the webMethods use of ‘cluster’ as an option for the IS isntances - if anything it should be ‘cluster-like’, and that would still be wrong…
We have an IS ‘cluster’ implemented, and are trying to architect in all the missing features… Breaking the cluster would help resolve some issues, but it raises concerns regarding above…
“…the ‘PRT’ needs to be in a cluster to ensure documents are not processed multiple times?”
I don’t recall seeing anything along those lines in the docs. Is there something you’ve seen that you could point me to?
The scenario you describe “shouldn’t” occur in either an IS cluster or non-clustered environment. This is because the Broker manages how it hands out the events. The Broker has no notion about IS clustering. Doesn’t know about it and doesn’t care. If an IS instance picks up an doc, doesn’t ack and crashes, the Broker sees that the connection is lost. It is then free to hand that doc to another IS instance.
The original instance won’t see that doc again (at least not from the Broker) since it was processed by another instance. When the original instance is restored, it will do nothing with the original doc. For the PRT, since it doesn’t automatically recover activity that was in flight (it instead relies on a document coming from the Broker) the risk of dupes is reduced.
Any type of clustering actually increases the chance of duplicate processing (even the wM clustering guide points this out). In the scenario you describe, the risk of dupes comes when the first instance successfully completes everything except sending the ack to the Broker before crashing. In this case, two instances will appear to have processed the doc. That’s why additional dupe checking is sometimes warranted, depending on the specific integration.
I’m okay with the use of the term “cluster.” It provides some facilities–we just have to make sure we understand what those facilities do and don’t do.
I have to admit, I have never seen my PRT statement documented anywhere… Your question got me thinking, and I guess somewhere, somehow, I was left with the impression that was a true statement. Guess I will have to follow up with wM to be sure.
Your comments about not clustering IS instances is really starting to interest me.
But one of your comments in your original post needs some clarification: ‘It is needed when multiple IS instances use the same Broker client prefix to connect to a single queue and the trigger has join conditions. This is also a rare case.’
So the key point above is the join condition in the trigger? Otherwise, multiple IS instances can use the same Broker client prefix?
Correct. Multiple IS instances using the same Broker client prefix can coexist just dandy without using IS clustering. IS clustering becomes necessary if any triggers use a join. This is because the first document of a join is persisted to the repository. The subsequent document(s) in the join may get routed to any of the IS instances that use the same Broker client prefix. The receiving IS instance needs to get access to the prior document(s)–which is stored in the repo so all the IS instances need to use the same repo, which can only be done using IS clustering. (Whew, that was lengthy.)
I’m mainly interested in the question of failover. In the white paper on high availability provided by webMethods, it does look clear to me that IS clustering alone provides an inadequate solution, since it “does not provide built-in failover for inbound requests from partners without webMethods IS.” This would be all of our partners. Relying on 3rd party failover/load balancing devices alone, however, seems to create the opposite problem: no failover for outbound services. The paper also notes that “services might accidentally get executed more than once for partners using guaranteed delivery” but I’m not sure why that is. So it seems like clustering may be necessary, if not sufficient, for failover. Since I’ve been tasked with developing a failover solution for our IS configuration, I’d be interested in hearing more opinions on this. Thanks,
IS clustering is very specific about what it provides for failover–it is provided only to clients that implement Context and TContext classes, which are used for “inbound” processing (page 14 of the clustering guide). There is no other direct support provided.
Even in the Context/TContext case, the servers don’t really do anything–it’s up to client to notice that the IS instance it was talking to disappeared and to try again with another instance. This sort of “notice the failure and try again” approach is exactly what generic HTTP/FTP clients can do with an LB cluster, so there really is no need to use Context/TContext–just use HTTP/FTP with some retry logic.
Failover for outbound services can be done in a couple of ways. It depends upon how your processes are initiated, how they function, and what components you’re using.
If your outbound activities are kicked off via a scheduled task, then you get a degree of failover since the scheduler is IS cluster aware. The process must be designed to such that if an IS fails in the middle of processing the next run will either pick up where it left off or more simply, just run the whole process again. Use of TN can help in this regard. The checkpoint/restart services can also help (chapter 6 of the clustering guide).
Hope this helps. With a bit more detail about your processes we may be able to provide additional general guidance.
Has anyone come across a software based open-source load balancer?
Wanted to see if there is one out there. That way as wM recommends, use a separate load balancer instead of the one available with IS 6.5.2
Cookies are used for session management. Session management is needed if a given interaction between the client and server requires multiple exchanges. Most often, however, the client and server interaction is “here’s a doc” from the client, “got it” from the server. One exchange where session management is not necessary nor is affinity (stateless service).
For cases where there are multiple interactions in single session, cookies or Context/TContext classes can be used. When using cookies, you’re right that the client must retain the cookie and pass it back otherwise a new (redundant) session will be started on the server.
I’ve seen affinity needed when using https/ssl for the ssl handshake. In this case affinity is based on IP, not cookies.
I hadn’t seen anything along those lines in the docs. The behavior of multiple clients connecting to the same queue and order being retained has been around in Broker well before IS ever became a Broker client. Can you elaborate on the “strange things?”
Yes, and if you’re using hardware LB (and no IS cluster) you must pay attention, to configure it correctly, so requests with cookie from one server will not come to the other. It may be done by configuring affinity on cookies on LB, or just using HTTP/1.1 features - puting many requests in one HTTP connection. For the latter you’ll need a steady flow of requests.
“Strange things” consists of:
locking the queue for a long time (sometimes queue was locked for hour and two, and then - it with no intervention - it became unlocked and started working)
core-dumps of Broker Server
But in my opinion, serial triggers a rarely (or even - never) needed. It might be a sign of bad design, as it is serialization at very high level - thus radically impacting performance from one side, and from the other - leaves many unsecured ways of breaking the serialization (ex. invoking via SOAP or from Developer). I always prefer to fine-grained serialization on critical resource (ex. rows in the database) and then - there shouldn’t be any other way to break needed serialization.
 - One thing I’ve never checked (as I don’t use serial triggers :-)) - does resubmitting the service from WmMonitor, that was invoked by a serial trigger, break the serialization (as there may be two threads invoking the service).
Thanks for starting a great discussion thread. Personally, I am not happy with how webMethods has implemented clustering.
A few weeks ago, one of the servers (P1) in our production cluster choked. All connections between this server and the RI server were lost. The other server (P2) also was not doing so good. It had only 3 out of the 10 connections left. We decided to restart P1, establish connections with the RI server, add it back into the LB, drain P2, restart P2, establish connections with the RI server and add that back into the LB.
However, after we re-started P1, we were not able to add the connections to the RI server. After several tries, the solution was to remove the clustering between the machines. After we did that, we were able to create the reverse invoke connections between P1 and RI.
Moreover, we were told that we had to restart BOTH the servers in order to re-establish the cluster connections!
Priceless discussion thread, in my view. You were absolutely right in saying about set of non-clustered IS operating with same client prefix. We recently encountered an issue with PRT documents. In our scenario, When a process execution moves from one logical server to another, status control document and Process Transition document(that comprise the join in the trigger) were picked up by different servers. The model simply hung with the state marked as ‘STARTED’. To me, it looks like we definitely have to enable webMethods clustering and make the servers share the same repository.
Many Thanks for the detailed information given in this thread.
Glad the thread has been helpful. That’s what wMUsers is all about!
Thanks for the scoop on the PRT activity. I did not know that there was a join condition for normal PRT activity. This would jive with what Ray Z. was saying about the PRT needing an IS cluster. Have you seen anything in the docs about this? Would love to see that to further my understanding.