IT Outages, how can they be avoided?

By Ilan Hirschowitz 30 July 2024

What happened in simple terms?

Friday, July 19th was a bad day worldwide and the week after wasn’t much better. A bug in a critical security component that was rolled out caused havoc around the world. Not since the doomsday predictions of Y2K | BUG 2000, which was supposed to have widespread dire consequences, has such an event affected our planet with banks, airlines, and major institutions having severe systems outages. At least with Y2K, companies were anticipating the risk—I recall reaching out to our Australian colleagues at midnight just to confirm that all was okay. This one hit the world completely by surprise.

The error messages 0x50 or 0x7E appear on a blue screen and the computer goes into a looping restart state. This renders the Windows terminal unusable. A SPOF or single point of failure can become any company’s biggest nightmare.

Companies normally have automatic updates turned on, so this bug was propagated unwittingly to all their servers on that Friday.

Who was affected?

Microsoft estimated that about one percent of worldwide Windows devices, around 8.5 million servers, were impacted by the bad release update. This figure sounds small but there were many more connected applications that were affected and became non-functional meaning many more end-users could not work. The airline industry was possibly the most visible victim since some airlines reported that upward of half of their IT systems are Windows-based.

Being robust, our customer’s Adabas & Natural mainframe applications were not directly affected by the outage. Still, systems may have been unreachable if endpoint connections off the mainframe were not available.

Why is it so difficult to fix?

A manual intervention by an IT specialist is needed since there are multiple steps to go through that an end user would find difficult to perform. The steps involve starting Windows in safe mode & deleting the offending file. Because the Windows machines are offline, the IT specialist needs to physically sit in front of every machine to perform the recovery, a time-consuming task.

You may ask, how can one avoid this?

One way is to have Point in Time (PIT) encrypted backups stored in a secure vault.

Both for Cyber malware attacks and for inadvertent Software bugs, PIT backups allow you to revert to a state before the attack or bug.

A cloud-based solution could also help where after fixing the underlying problem, new servers could be easily provisioned and spun up quickly, even though the old ones were unavailable.

How can a robust cluster solution protect your mission-critical databases?

If you rely on an Adabas database, you have another way to recover from an outage. Software AG’s Adabas Cluster for Linux solution allows a transparent failover mechanism for any Adabas database outage under Linux. When it comes to patching, each individual node is taken out of the cluster & each is patched separately without affecting the health of the remaining cluster nodes. If one node gets a bad update, then the others continue to function as normal. This is also critical in preventing common single points of failure scenarios. The same holds for our mainframe Adabas Cluster Services which takes advantage of a z/OS® Sysplex cluster.

This has several advantages such as high availability, robust disaster recovery, scalability, and efficient isolation of systems in the Sysplex from others, providing a secure environment for different applications.

Finally, this also solves the patching and SPOF issues.

Hopefully, some simple preventative measures such as these will allow DBAs and IT Admins to sleep better at night!

Learn more about Adabas Cluster for Linux | Software AG

6 Likes