Each server today is Windows, with 2 CPU and 3.5GB of RAM. They are all live and hosted in the same subnet… it should also be noted, our IS server is nearly 100% of our application - receive/complete/route/respond/etc…
If it weren’t for a plague of network device failures between our IS servers and the DB cluster (even redundant network environments can fail terribly)… that is, if we subtract all of our downtime caused by the device failure and network issues between our db server and IS, we would be operating at an average of 5 9’s (yes, that’s 99.999%) for the past three years… as it is we’re still operating at an average of 99.8% over the past three years. This includes an entire subnet migration, IS upgrades from 4.0.1 to 6.0.1, OS upgrades, server replacements (including a db migration to new hardware).
Since we recently moved from the Cisco CSS to the Foundry ServerIron Load Balancers - we’re looking at leveraging some of the Layer 7 features in the Foundry, such as the ability for the LLB to read content from an HTML page and balance traffic based on the content of that page - e.g. we could have a dynamic page display HTML of the resource utilization of our three servers - if one of them were particularlly stressed, we could direct traffic to the other two servers (weight the traffic). With the Cisco (and how we currently run) - healthchecks are performed at Layer 3 (port 443 listening? yes?? server is up, send connection there… no?? don’t send traffic to that Real IP) and balancing is based on active TCP/IP connections to the Real IP.
We also have a “standy-by” Disaster Recovery environment in another data center hosted in another region - it is a perfect mirror image of our primary server farm, running hot - except the DB server is in standby mode (no Oracle RAC, !!yet!!) though the configuration for the Global Load Balancing is setup, we have not be able to test it live. In theory, if our primary site becomes unavailable (port 443 not listening anymore) - the Foundry Global Load Balancer will route 100% of new connections to the hot-stand-by site - we’d have to activate the DB server, then follow up with database correlation between the two sites. Unfortunately, there are NUMEROUS J2SE and J2EE based “dispatchers” out there which cache the DNS independently of the system DNS… thus, if our Virtual IP address changes, those systems would need to restart (there are MANY MANY more marketsites affected by this than you want to know - sigh if only they’d set the JVM flag which disables local DNS cache or AT LEAST set it to flush the cache at a regular interval). Restarts are the enemy - especially if a client/partner/customer needs to restart due to a change you make. tsk tsk.
We’re looking at throwing Broker or another vendor JMS queue in to our environment to help reduce our public layer from the database dependency - if we did we would probably try to have a globally load balanced Broker cluster, as of today I haven’t discovered any clear documentations of the benefits or issues this may produce (the benefits seem clear, it is the issues that are hard to discover). Honestly, we haven’t really decided which is a better fit for our messaging stategy - webLogic JMS or webMethods Broker - so we have to decide on that first.