Network Loadbalancer and wM...

Has anyone tried to ‘improve’ availability of brokers and/or IS instances via use of a network (not IS) loadbalancer??

Basically, have either two brokers or two IS instances configured identical and have a network loadbalancer in front, directing inbound traffic to the ‘up’ instance.

I am looking to do both, have brokers and IS instances ‘behind’ a network loadbalancer…

What issues did you encounter? How did you overcome them? Would you do it again? :wink:

Thanks!
Ray

Sorry… Forgot this topic was part of this thread: http://wmusers.com/forum/showthread.php?t=8240

Some how we took a turn in the above thread onto a db discussion early on… :o

Yes, I’ve had a couple of projects where an external LB device was used to LB IS instances. In one case, 3DNS and BigIP was used to balance traffic between two data centers, each with 3 IS instances (6 total). Another case used the Cisco Content Services Switches to LB 2 instances. Both worked just dandy, providing the ability to take instances off-line without disrupting service. Tuning the “stickiness” of sessions was needed for both. Would recommend this approach always.

Connections to the Broker can’t be effectively load-balanced. When an IS connects, it keeps the connection. OS active/passive failover is the only real option here, though I think 6.5 has introduced some more facilities.

Rob… Cool!

What kind of pipe was available between the data centers? :eek:

Your customers must have big bucks! :wink:

I didn’t think there was an option with the brokers, I have been thinking about ‘how’ on and off for a few years… but 6.5 maybe, eh? Guess I have to do some reading… :mad:

Thanks!

Ray

The first one is a Fortune 50 company (top-half of that actually). Not sure about the connectivity between the DCs. This was an IS/TN installation (for this project anyway). No Broker. Used Oracle for the TN DB. HA attempted for that but not achieved (an attempt at RAC but they couldn’t get it to work right to be able to take a DB node down without interrupting service–but it’s been a couple of years now so I think they may have it now).

The Broker is usually pretty solid so using OS HA for that has been effective. For scaling, it’s always been the case that you need to segment the solution space–e.g. supply chain there, finance over there, etc. Even when software LB support is added to Broker (again, I think 6.5 has some facilities) it’s probably prudent to segment things anyway to help keep the risk of dupes down. As always, architecting the infrastructure is a series of trade-offs.

Rob,

Can you direct me to the 6.5 Docs that might have new info on Broker capabilities?

I scanned the Indexies of:
webMethods_Installation_Guide_6.5_and_6.5.0.5_and_6.5.1.pdf
webMethods_Broker_Administrators_Guide_6.5.pdf

And nothing seems to jump out about new features… I couldn’t find an ‘HA’ guide on Advantage. I’m guessing the info is in that one…

Ray

I looked in the read me’s and didn’t find anything either. I think it must be something that’s coming soon.

The HA guide is only available via an engagement with wM Professional Services. They started doing that some time back. Guess they got bitten a few times by people doing it on their own but it isn’t overly difficult. Check with your account rep or TAM and they should be able to set you up.

Rob,

Can you elaborate on this? Is this only necessary when you are using stateful sessions in a conversation. If you are doing stateless request/reply would you need to use session stickiness?

-Adam

In one case, we needed to set session affinity to allow the SSL handshake time to complete–connection from a single IP routed to the same LB instance for a period of time regardless of traffice.

In the other case, it was indeed for stateful sessions where multiple connections for a single IS session were made. For single interactions, such as a synchronous request/reply, stickiness would not be required.

The key here is to have a test bed in which you can verify behavior of the LB device and that it’s configured as needed.

Issue is when you finish your summary and realize between your customers and your Managment, you have to make one unhappy… :sad:

I usually try to not refer internal groups as “customers.” That tends to set up a master/slave (or payer/payee if you prefer) relationship when in fact it is a team effort that should focus on the customers of the company. When approached that way, usually the members of the team recognize the overall benefit and are oft more willing to accept “local” pain for the greater good.

But that’s another topic for another thread perhaps…

We run IS load balanced behind a Foundry Load Balancer (previously behind a Cisco CSS)… sticky is off, and we don’t use cookies - all of our clients are configured that if they don’t get an ACK, they resend the document, so we don’t send the ACK until we have the document stored (would be a differnt world of hurt if our inbound connections were send and forget). We have three servers so that in case we need to take out one, we still have a pair of servers running. Our environment consist of three independent IS 6.0.1 servers, aside from sharing a SQL Server Cluster - the servers have no knowledge of each other. It is nice to not have to become unavailable to apply a system patch or even to replace an entire server - it is also nice to be able to sneak a new server in to the “Load Balanced Farm” without having to change configurations on the other server.

Each server today is Windows, with 2 CPU and 3.5GB of RAM. They are all live and hosted in the same subnet… it should also be noted, our IS server is nearly 100% of our application - receive/complete/route/respond/etc…

If it weren’t for a plague of network device failures between our IS servers and the DB cluster (even redundant network environments can fail terribly)… that is, if we subtract all of our downtime caused by the device failure and network issues between our db server and IS, we would be operating at an average of 5 9’s (yes, that’s 99.999%) for the past three years… as it is we’re still operating at an average of 99.8% over the past three years. This includes an entire subnet migration, IS upgrades from 4.0.1 to 6.0.1, OS upgrades, server replacements (including a db migration to new hardware).

Since we recently moved from the Cisco CSS to the Foundry ServerIron Load Balancers - we’re looking at leveraging some of the Layer 7 features in the Foundry, such as the ability for the LLB to read content from an HTML page and balance traffic based on the content of that page - e.g. we could have a dynamic page display HTML of the resource utilization of our three servers - if one of them were particularlly stressed, we could direct traffic to the other two servers (weight the traffic). With the Cisco (and how we currently run) - healthchecks are performed at Layer 3 (port 443 listening? yes?? server is up, send connection there… no?? don’t send traffic to that Real IP) and balancing is based on active TCP/IP connections to the Real IP.

We also have a “standy-by” Disaster Recovery environment in another data center hosted in another region - it is a perfect mirror image of our primary server farm, running hot - except the DB server is in standby mode (no Oracle RAC, !!yet!!) though the configuration for the Global Load Balancing is setup, we have not be able to test it live. In theory, if our primary site becomes unavailable (port 443 not listening anymore) - the Foundry Global Load Balancer will route 100% of new connections to the hot-stand-by site - we’d have to activate the DB server, then follow up with database correlation between the two sites. Unfortunately, there are NUMEROUS J2SE and J2EE based “dispatchers” out there which cache the DNS independently of the system DNS… thus, if our Virtual IP address changes, those systems would need to restart (there are MANY MANY more marketsites affected by this than you want to know - sigh if only they’d set the JVM flag which disables local DNS cache or AT LEAST set it to flush the cache at a regular interval). Restarts are the enemy - especially if a client/partner/customer needs to restart due to a change you make. tsk tsk.

We’re looking at throwing Broker or another vendor JMS queue in to our environment to help reduce our public layer from the database dependency - if we did we would probably try to have a globally load balanced Broker cluster, as of today I haven’t discovered any clear documentations of the benefits or issues this may produce (the benefits seem clear, it is the issues that are hard to discover). Honestly, we haven’t really decided which is a better fit for our messaging stategy - webLogic JMS or webMethods Broker - so we have to decide on that first.

Broker cannot be load balanced.

Can you elaborate on using Broker “to help reduce our public layer from the database dependency?” What do you mean, exactly?

I had a conversation with a WM Broker person a few weeks back in which he indicated that software-based HA for Broker was under discussion, but very, very difficult to pull off.

I did not get the impression that it was in the “coming soon” category despite having been mentioned in a Product Roadmap presentation at IW 2004.

Mark

Initially our solution was developed for E2E, over time the bulk of our integrations have shifted to marketsites - or, our partners have upgraded their “dispatch” or “integration” systems to newer products which have more flexibility - and don’t need a synch response.

As I mentioned - today our IS server is 100% of the application… it recieves, completes, routes, and responds. Unfortunately, the architecture of our solution is nearly 5 years old and some of the legacy integrations depend on a synchronous response which includes referenced data. Trying to maintain a standard across all of our integrations - that model was adopted even for integrations which did not need synch responses. For example, today we may we receive a document, validate the document based on content within that document (which may require referencing partner rules from a reference database), produce a tracking ID (by interacting with our reference database or a backend service), and then send the ACK back with status 200. If our backend systems are down, our db server unavailable or experiencing latency, we have issues.

We’re re-architecting such that at least 99% of our integrations are asych - so it will hopefully flow something along this pattern: Via IS we receive the document, authenticate it, store it to a message bus, and send an ACK that we have the document. We can then transform the messages to our internal format and perform whatever processing is necessary to complete them. Once completed, we can send the response back to the partner.

Can you elaborate on how a broker cluster would know that it is load balanced? Or, what about load balancing would cause issues for a clustered Broker?

Message order, guaranteed delivery, and just-once delivery would be compromised in a Broker cluster environment.

Generally, with the exception of custom Broker clients that have short execution lifecycles, connections to the Broker are persistent, not transient (like HTTP). When IS connects to the Broker, it makes the connection and keeps it. So even if you put an LB device in there, traffic will not be LB’d. Published docs from a given IS will always go to the same Broker. Subscribing triggers will always pull from the same Broker. Broker was never designed to be LB’d.

Let me put this Load Balanced Broker discussion on hold for one more day… I’m meeting with webMethods folks tomorrow and will present this scenario.

Thanks!