Hardware Clustering/LB using BigIP

I know there are already plenty of topics (I am still in the process of reading them all) but I would like to get some more technical experience with hardware clustering and webMethods 6.1.

For now, we have available:

  • 1 to many IS 6.1 SP2 (proxy) as frontend, and 1 to many IS 6.1 SP2 (executing) as backend. Backend is using Scheduler, Monitor, Broker and PRT.
  • 1 Broker 6.1 SP3
  • 1 LTM1500 (F5, formerly known as BigIP) hardware load balancer
    All this running on a Linux Red Hat 3 box.

We want to implement high availability that would prevent single point of failure except for:

  • database (considerered HA)
  • NFS file system (considered HA as well)

How should we implement the HA option, given these tools available to support :

  • LB of frontend (optional but recommended)
  • LB of backend (required)
  • LB of broker (I know only OS clustering is supported: but how to implement it?)

Questions about this:

  • what pros/cons about using NFS shares to share data between clustered IS?
  • how to keep scheduled services synchronized between the IS in the cluster?
  • does hardware clustering impacts functional behavior of PRT/BPMs in Monitor? More generally Broker impacts associated to clustering?
  • should we share DB or file repo between load balanced IS? (in the backend)
  • any known impacts you ran into, that we could learn from before experiencing the same issues?

If you have some evolved scripts to configure the BigIP LLB, let share it to us as well. For instance, how do you make sure the IS is “healthy” (at a higher iso layer than 3)? Is it thinkable to implement a routine check that would make sure IS is up, Broker is up, File system is available, etc, to identify the readiest IS available to take incoming requests.

Thanks for sharing your experiences!

Side note: F5 wasn’t formerly known as BigIP. BIG-IP is a product from the company named F5 Networks.

My recommendations:

Use BIG-IP LB for both for both front-end and back-end IS groups. Be aware that you may need to configure the BIG-IP for some sort of server affinity depending on your integrations.

Use OS cluster (active/passive) for Broker Server. It’s not likely that you’ll need multiple load-balanced Brokers to handle your volume, though I may be wrong. Don’t try to LB Broker Server. See this discussion.

For sharing data on an NFS mount, be sure that you have a mechanism in place for file-locking.

For scheduled tasks, you’ll either need to use IS clustering (which I usually don’t recommend) or some other mechanism (e.g. AutoSys) for running jobs that are supposed to run on just one instance at a time.

For the DB and the repo (which implies you’re using IS clustering) you most certainly share them amongst the IS instances.

Each IS instance should have the exact same configuration (except for properties related to cluster and scheduled tasks) and packages loaded. They should have the same adapters pointing to the same resources.

Scheduled tasks and polling notifications need special attention. If you’re using IS clustering (not its load balancing) then you’ll be okay. If you’re not, then you’ll want the scheduled tasks that are “just one running at a time in the cluster” then you’ll need to establish one of the instances as the scheduler IS and all others will have the “cluster” tasks suspended. Same for the polling notifications.

Thanks Rob. Scheduled Tasks is indeed an issue, as most of our integration is batch based, usually kicked off by the IS scheduler. Even if I am not happy with the level of user friendiness (i.e. please provide me a way to list all executed/failed jobs between 4am and 9am), I don’t see a window in the short term to replace it by another 3rd-party tool. Not to mention that this tool would need to be highly available too.

  1. Is there any alternative to share the jobs’ status between IS other than using IS clustering?

  2. I understand that we can use the IS Clustering (webMethods clustering solution) if we use upfront our BIG-IP Load Balancer. Is there a way in the IS where to specify NOT to try load balancing?

  3. how does PRT reacts when 2 non “IS clustered” instances shares the same Broker (using same prefix?). Isn’t there a risk to get a BPM instance started on one IS but cont’d on the other one (making both half-part to fail). Or is this prevented by just sharing the ProcessAudit schema on both instances? I believe to have read something about PRT issue when JOINs conditions are involved in a IS trigger, which seems the case when Modeler generates a trigger for a given BPM. Can you elaborate on this?

  4. Based on your experience, how long does take the implementation of clustering involving hardware clustering on the projects you’d working on?

Thanks for sharing your own concrete experiences with us!

Not anything out of the box that I’m aware of. You could of course roll your own facilities to keep track of scheduled task execution.

Yup. Just don’t set up the load balancing config portion of IS clustering and you’re good to go.

I remember the discussion about subscription JOINs in the PRT but I don’t think we concluded anything concrete. It would be good to confirm the behavior with wM before relying on anything in the forums.

It depends on the skills and background of the OS system admin more so than anything. With someone who has set it up before, it can take just a couple of hours if even that long, to set up a Broker Server cluster (assuming hardware is available). It’s not all that involved of a process.

According to our described needs (basically fail over of IS including support for outgoing transactions) would you recommand to use IS Clustering (even if you are usually recommending against it)? What are the main concerns and issues we should be ready to face when using IS Clustering (again for fail over, not LB)?

We will have mostly BPM processes (Monitor/PRT) running on the IS, so interacting with Broker, and most of these processes will be kicked off by scheduled tasks. Is there any other way to implement a fail over of the IS that you would recommend, ideally involving no (or almost no) downtime?

IS clustering can help with scheduled tasks but other than that IS clustering doesn’t help with outbound failover. That’s something you’ll need to address yourself. One approach is to use DB tables as an outbound tracking queue which multiple IS instances process.

If you use IS clustering, make sure you cover performance and failover of the repository server. It’s a key component and early on was a good source of cluster misbehavior.

Integrations using the PRT do not need IS clustering. The IS instances can share the same PRT tables. This thread has a discussion about whether or not the PRT uses JOINs for normal operation, which would mean IS clustering would need to be used. The consensus seems to be that it does but there isn’t clear documentation either way.

The IS interaction with Broker is completely unaffected by the use or non-use of IS clustering. JOINs are purely an IS notion–the Broker has no clue about them.

The best high-availability approach I’ve seen is what I’ve described earlier. Clients that interact with IS must use a “positive-ack or retry” approach. Meaning, when a client submits a document to IS it must wait for a positive response of at least a “got it” indication. If it doesn’t get one, it should retry. Your integrations on IS should behave the same way when talking to other systems.