Broker Server storage corruption - Mitigation and Recovery

In a default installation, Broker Server storage is divided into two separate storage sessions: config and data.

Config storage contains meta-data like, brokers, client groups, document types, clients, territory, cluster, gateway etc. Unless modified, a set of 3 files make up configuration storage: BrokerConfig.qs, BrokerConfig.qs.log, and BrokerConfig.qs.stor.

Runtime storage contains the guaranteed messages, client queue, and queue statistics. Unless modified, following set of 3 files make up runtime storage: BrokerData.qs, BrokerData.qs.log, and BrokerData.qs.stor.

Broker Server storage corruption is high impact event as you risk losing message as well as configuration. If configuration is lost, and brokers are part of territory/cluster, then recovery is even more complex. So, it is important to avoid storage corruption as much as possible, and have a recovery plan for speedy recovery.

Avoiding storage corruption

The key parameter to minimize the risk of storage corruption is to use a reliable storage system that support synchronous writes. This alone will eliminate biggest external factors that typically causes corruption.
You should also take care to shutdown Broker Server gracefully, instead of using 'kill -9' directly on process. Use following approach to shutdown a running Broker Server:
To stop a specific server, use "broker_stop" or "server_config stop" commands, or stop from MWS.
To stop all servers and monitor, use "server_config stopall" or "shutdown.sh" commands.
Shutdown is very fast operation, so wait for a minute. If server is not shutdown in time, then it is probably a issue that may need to be analyzed further. For later analysis, use pstack (2-3 times in 5 second interval) and gcore to collect diagnostic data and report a defect on slow shutdown.
If process is not terminated within a minute, then first attempt "kill".
Use "kill -9" only as final resort, as it terminates the process abruptly, and can potentially cause storage corruption.

Backup configuration

No amount of precautions can prevent storage corruption. So, periodically backing up the running Broker Server configuration data will go a long way in recovery.
Note: File system level copy will not provide reliable backup when broker is running and actively modifying the files. In fact, very likely such backup will not be usable.
Note: There is no support to copy runtime data that contains messages while broker is running.

To backup, you should periodically backup as follows:
- Use "server_conf_backup" on a running broker. This will backup configuration data (typically BrokerConfig.qs* files). This operation can be done once in a day on a running server. During backup, broker is briefly paused to allow backing up files in reliable way. The time taken for backup is directly proportional to size of configuration files. This backup can directly be restored by using "server_conf_restore".
- Use "broker_save -server" on a running broker. This will export the configuration data for full server. This operation can be done once in a day on a running server. During backup, broker keeps on running. This backup cannot be restored directly, but customer can recreate the server and import the configuration by using "broker_load".

It is recommended that you backup the configuration after you modify configuration, for example, add or modify document type, client.

Recovering broker server

In case of storage corruption, Broker Server will either fail to start, or it will periodically crash with some storage related error, including reference to "QSID - queue storage id" in its log file.

Step 1 (check and fix)

If you suspect storage corruption, then best is to stop server, and take full data directory backup first. Then, run "server_qsck check". If corruption is reported, then run "server_qsck fix" to fix the storage issues. If server_qsck fixes the problem, then simply start the server.

Step 2 (if applicable, recreate runtime storage only, thus losing only messages and retaining good old configuration storage)

If server_qsck fails to fix the error, then you should check the server_qsck output to determine if you have only runtime data (typically BrokerData.qs* files) are corrupted. If only runtime data is corrupted, then you can choose to throw away runtime storage only (thus losing messages only) and recover quickly as follows:
- Stop the server if it is running
- Delete runtime storage files (default ones are BrokerData.qs, BrokerData.qs.log, BrkerData.qs.stor). If you don't know the actual file names, then you can find the file name from awbroker.cfg session-data configuration, and then run "strings <filename.qs>" Linux command or its equivalent to find log file and store files.
- Start the server and let it create new runtime storage files with their default configuration.
- If you need to resize the storage, then stop the server, and run "server_config storage" command to resize the storage files according to your need.

Step 3 (if applicable, restore backed up configuration)

If configuration storage (typically BrokerConfig.qs* files) is corrupt, then follow these steps to recover:
- Stop the server if it is running.
- Run server_conf_restore to restore last known good backup. This will restore the configuration storage files in the data directory.
- Run "server_qsck check" to ensure that no error exists.
- Start the server
Note: If broker was part of territory/cluster, then run the health checker to ensure everything is in sync.

Step 4 (if applicatble, import the exported configuration)

If configuration storage is corrupt, and config storage backup cannot be used, then previously exported data can be used to rebuild the server. This rebuilding will work prefectly fine, if brokers were not part of territory/cluster or gateway. However, if brokers were part of territory/cluster, then just re-importing may not be sufficient.
- Stop the server if it is running.
- Delete the server.
- Create a new server with required configuration and runtime storage settings.
- Start the server.
- Run "broker_load" to load the configuration from the exported files.
- If territory/cluster and gateways are configured, then this newly created broker may not be able to rejoin the territory/cluster or its peer may refuse connection.
- Check the territory/cluster brokers and remote brokers for all participating broker and see if they are showing connected to each other.

Step 5 (if applicable, use Integration Server)

If no backup exists, then you can still quickly restore the setup for native triggers by following steps.

- Stop the server if it is running.
- Delete the server.
- Create a new server with required configuration and runtime storage settings.
- Start the server.
- If applicable, rejoin the territory. Refer to last section for more details on territory related issues.
- Push all document type and trigger definitions from Integration Server
- Check the territory/cluster brokers and remote brokers for all participating broker and see if they are showing connected to each other.

Step 5 (if applicable, use JNDI definitions)

If no backup exists, then you can still quickly restore the setup for JMS Topics/Queues triggers by following steps.

- Stop the server if it is running.
- Delete the server.
- Create a new server with required configuration and runtime storage settings.
- Start the server.
- If applicable, rejoin the territory/cluster. Refer to last section for more details on territory/cluster related issues.
- Run jmsadmin program to pull all JNDI Topic/Queue definitions from JNDI provider and create them on new broker.
- Check the territory/cluster brokers and remote brokers for all participating broker and see if they are showing connected to each other.

Post storage recovery checks for territory/cluster brokers

If Broker Server was hosting a Broker that was part of territory/cluster or gateways, then restoring its configuration back to a backed-up file may not be sufficient, and territory/cluster status may be incorrect:

Run the utility to check territory/cluster health passing in as many brokers (preferably all brokers) as argument. In case of any conflicts, they needs to be corrected one by one.

If you rebuilt the server, then it is possible that you are not able to join back the territory/cluster. In that case, you need to check the remote broker lists for all the brokers, and then remove all reference to old broker. Note that while removing remote brokes, forward queue is also removed from those broker. If forward queue contains important messages, then you can use Queue Export and Queue Import (Forward Queue Browser based utility clients) for the messages before deleting remote broker reference. Once all stale refereces are removed, then you can rejoin the broker in territory/cluster.

Server startup or server_qsck taking long time

In some cases, server startup takes long time, which gives impression that there is some storage issue. Subsequent running of server_qsck also takes long time, again giving impression that there is some storage corruption which is causing process to hang. Before terming that process is hung, you should check the disk utilization for data directory. If lots of disk reads are shown, then it just may be a case of slow startup, not a case of storage corruption.

Note: Though server_qsck can and will take long time with large storage as it needs to check each and every stored data. But, server startup should not take more than a minute on fast storage as server used on demand loading of data. So, treat slow server startup as potential issues.

QueueImport.zip (3.2 KB)

QueueExport.zip (3.64 KB)