Service Availability - 'parallel' production environment

Hi All…

I am seeking some advice… Not sure this is even possible, but thought I would open it up to the ‘world’…

OK, I have an IS cluster in an Active/Active configuration running on the same hosts as the Brokers which are running in Active/Passive standard HA (yep, similar to the recent thread that Mark covered so well)…

My issue, like many I am sure, is overall service availability…

So I am looking to redesign our Architecture. We have been bitten by some nasty IS bugs and the question has been raised about parallel production environment availability…

I have to think this is an issue for all the bankers out there… :wink:

With the IS one broker limit, the requirement that all clustered IS instances access the same broker … how can a parallel production environment be implemented?

I have thought maybe network loadbalancers might help, but then the broker gateway requirements pop up (keeping the two environments ‘linked’ would cause a gateway loop).

The only idea along the same lines is a separate parallel environment behind a network loadbalancer, but how then would you avoid a network ‘hiccup’ causing part of a ‘process’ or ‘conversation’ from jumping to the other ‘environment’… not to mention the two clustered IS setups would be using different db instances - and all the issues related to ‘half’ conversations.

Of course having both environments access the same db instance, but that would mean the db is the single point of failure for everything - which is how our current environment is setup, but we got hit last year by that weak point (painful 18hr outage waiting for db to be reconstructed)…

Anywho…

I hope this gives you an idea of what I am trying to get my brain around…

Any ideas, insight, suggestions, good jokes :lol: to keep me going?

Thanks!!!
Ray

Great topic Ray. This is something I’ve got on the plate for this year as well. I couldn’t come up with a great way to do this that would still be supported by webMethods. I would love to here from others as well if they have tackled this.

Here is where I’m leaning right now. I’m only going to load balance client initiated transactions. In other words HTTP requests or SOAP over HTTP. Things that are getting pulled in via adapters etc will stay in there current architecture.

I’m going to load balance against two separate instances of the IS server, and the Broker server. They will each have their own database. The load balancer supports sticky clients so I can do request/reply without worrying about it going to two different servers. That of course still doesn’t completely solve the problem of a transaction getting entered twice. Especially on a network blip as you say on the return reply. User or application might hit submit again even though the transaction may have completed.

We tend to try and lean towards our client request being asynch which makes this easier. Just drop it into the Integration layer and forget about it. The request/reply presents more challenges as you already know. That’s the part I’m still working on. :slight_smile:

Hopefully others that have addressed this will respond.

Okay for the joke - What do you call a boomerang that doesn’t come back?

what do u call it if it doesnt come back?..now im really curious !

Scooby…

I believe it is a ‘stick’ … :wink:

Mark,

What db is your IS instances using?

I ask because I was recently told about Oracle ‘logical standby’ instances… (available starting with 9i)

My understanding: one could create multiple ‘logical standby’ db instances of a ‘primary’ instance. The benefit of the logical standby instance is that it is an accessible db instance (forgive me for not knowing correct terms, I am not a dba), meaning you can run reports from it and make changes…
AND!
… again my understanding, you could also access it via an app and have changes updated back to the primary.

I wondering if anyone has used this ‘feature’ of Oracle and can comment…

My thinking is to create the duplicate production environment, and have the db instances setup in a logical standby mode … hence each environment has own db, but dbs are sync’d…

Another solution is Oracle RAC … which, if implemented correctly, availability would increase over ‘old’ HA setups - as I understand RAC (but RAC could be very expensive)

My concern regarding the logical standby config would be ‘sync time’ of the db instances… I think the integrations would need to have some kind of ‘duplication’ checking methods in place on the ‘non-wM’ components (dbs and others)…

Ray

The joke that is. We are using Oracle but we are justing moving to 9i. I haven’t really read up on the logical standby database instance concept. It makes sense though kind of reminds me of multi-master replication with LDAP or even Lotus Notes.

But would you have two active separate IS instances up and responding to requests? I’m wondering about database repository stored configuration items that are specific to an instance, specifically thinking about all of the webm******* tables. If the instances(db) were being synched, wouldn’t that cause some issues or not? Or would the secondary IS instance be down until needed? I guess it would have to be otherwise you wouldn’t be able to hook it up to all your source or target systems.

Could you have a inactive IS instance pointing to your standby database and then just bring it alive via Veritas once you determine you need to? If the database is active, then the IS startup would be the only wait time it that case.

I’m just thinking out loud here by the way. Anybody actually tried this feel free to chime in.

Mark,

My first thought was both production environments would be up at the same time - in order to increase ‘availability’ - but you raise good points…

Also, I just did some ‘google-ing’ of Oracle logical standby… seems it doesn’t allow updating the primary (unless I am missing something), also not all datatypes are supported. So maybe this won’t work. :frowning:

I’ve also sent a question to wM support about logical standby and the datatypes… so I’ll pass on any info I get there.

So as it stands… I might be able to do a logical standby, but any wM components ‘hooked up’ would have to be disabled until needed.

But along the lines of increasing ‘uptime’ … any experience with non clustered IS adapters? Duplicate (identical) ‘adapter’ instances? or would these have to be clustered?

Ray

Ray,
I was thinking(still out loud) maybe using disk level replication to sort of achieve the logical standby. If you used SAN storage for both your Oracle side and your webMethods side, you can replicate at the SAN level to another site(logical or physical). This type of replication is generally pretty fast. The secondary site would be down but Veritas could handle bringing it up when failure was detected with the primary. It’s not immediate but it would sort of kind of fast.

My biggest concern would be what constitutes a failure? If it is more than hardware(which then just plain old Veritas clustering can handle) then wouldn’t having a replicated parallel environment either through disk level or through Oracles mechanisms just replicate the problem over to the other environments? Not sure about this I guess it would depend on what caused the failure in the database.

I certainly understand the issue with the Database on the back end. We haven’t had major failures but a lot of minor ones especially with patching. It bites having the Integration layer down while the Database is being patched or whatever. The Oracle RAC is probably the Oracle answer for this but after reading the literature on it, I’m a bit worried about the increased complexity it would bring in our environment. Kind of cool though with the grid stuff. :slight_smile:

Mark,

I have been having ongoing discussions with folks and we have determined even a RAC implementation will not resolve the issue of db being the single point of pain.

Just as you called out, patching the db, even in RAC, would require a db outage.

With that in mind, it appears our remaining option is an Oracle Grid. I haven’t done much investigatation yet, but my understanding is the Grid would allow the db to be down for a patch install, yet the Grid architecture handles “standby” db activities so that the app is unaware of an outage.

Of course, if RAC is too complex of an implementation - out of the ballpark, a Grid has to be on another planet! But! A grid should be able to be implemented on ‘cheaper’ hardware than RAC, I would think…

I am starting to look into this option… Which will spawn a post on the IS board about any Grid use…

Ray

Mark,

Regarding file level replication… yes, this was discussed too… and the point you raise was discussed. In order to have the ‘copy’ as close to production as possible, you basically have to do dual writes - but this would cause any ‘bad’ data to be written to both locations and hence not providing benefit.

And as you say, everything depends upon the nature of the ‘issue’ or failure. But most times, the non-wM apps are not the issue… it is the data, which causes an issue.

Ray

Mark,

Can you expand alittle on your load balancer and broker implementation? or did I misunderstand your idea? :confused:

Will you have two IS instances load balanced to a broker? or two IS instances AND two broker instances, each with loadbalancers in front?

Ray

Hey Ray,
I planning on two IS instances, two brokers and a separate backend oracle instance for each. The load balancer will sit in front of IS instances. The IS instances will not be tied together except that they will perform the same set of functionality.

I won’t be doing any internally initiated transactions on these instances ie JDBC Notifications, Listeners etc. This will be all external traffic via HTTP. I’m using sticky sessions on the load balancer.

There is a small chance for duplicate transactions because an IS instance could crash before acking back to the client. But this is pretty rare based on our transaction types and speed.

Mark,

OK… so you will have two dedicated IS/broker pairs behind a loadbalancer…

LB → IS/Broker <-> data

I’m guessing your environment doesn’t use broker gateways then? :frowning:

Also, your brokers will be on dedicated hosts (two brokermons?), correct?

Do you have B2B playing in any of your environment?

I have split ‘b2b’ and ‘adapter’ functions (wanted to make it ‘easier’ to find issues)… so I have IS instances doing b2b only or adapter only.

b2b-IS <-> broker <-> adapter-IS <-> data
as well as
b2b-IS <-> adapter-IS <-> data as needed

But I also have a large EAI environment from old days, and my brokers have gateways (hub-spoke config) to exchange events…

Ray

That’s right LB → IS/Broker <-> data . They are on physical dedicated servers.

We are doing B2B as well as broker gateways but not as part of this particular solution. This architecture is really only designed with client http requests in mind and fairly simple integration requirements otherwise problems with transactions and state management etc crop up.

The other instances (that are not going to be load balanced) handle the more complex stuff. My IS’s are generally split out by major integration function and project. They do not get much more fine grained than that. I do break up the integration architectures within the IS into Adapter functions, Canonical Functions etc.

Hi Guess,

We have almost the same scenario. We have following scenario. 4 instances of IS and broker.

Network load balancer > IS1 | BROKER > Oracle 9
Network load balancer > IS2 | BROKER > Oracle 9
Network load balancer > IS3 | BROKER > Oracle 9
Network load balancer > IS4 | BROKER > Oracle 9

You should implement duplicate check service to check in all DB instances for unique data based on data from customer transaction. Because syncing data between instances is not easy/fast task, even using oracle RAC. This is the only way what I see. You will have performance degrading because this DB checks on all DB instances, but I thin this will be acceptable. Of course this depends of business needs. Need testing.

Best Regards,
Krasimir Zlatev

Are these 4 independent db instances? And you’ve implemented this ‘duplicate check service’ to catch dups in data?

Mark,

Sorry, got busy… but then who isn’t these days…

Have you implemented your ‘mirror’ environment? How is it working?

Ray

I’ve implemented it in test. No issues yet but it is a fairly simple set of integrations.

I’m wondering how you are doing on your database availability issue? Have you made any architecture decisions yet?

Hello Mark,

Could you post some more details what you have implemented? I am very interested in this.

Best Regards,
Krasimir Zlatev

There is really not much to it. Two separate environments (IS server and broker) on two different physical servers with separate backend databases. The servers are fronted by a Cisco content switch. The code is duplicated on both servers. These are all stateless transactions initiated by an external client(s) via http (web services).