Hello everyone - Can I have your comments on this issue please? I’m not a broker person - do forgive the newbie questions below.
I have a source Integration Server publishing documents to a broker. The document setting are: ‘Storage Type’ = ‘Guaranteed’ and ‘Discard’ = ‘False’. A trigger on a destination Integration Server subscribes to the published documents, and invokes a DB interface service to inject them to an external database. In summary, this is the setup:
Source IS -> Broker -> Destination IS -> External Database
What is the best approach to make this setup reliable in the face of system failure? For instance, if the external database does down, the trigger still subscribes to the document, but the DB interface service returns failure. Does the trigger notify the broker the document is still unprocessed, or is the document now irretrievably ‘gone’ from the broker?
If so, do we need to tweak the trigger’s ‘Max attempts’ and ‘Retry interval’ settings, or the DB interface service’s ‘Retry on IS Runtime Exception’ settings?
What happens if the Destination IS now also goes down - will the retries currently being processed be lost?
Someone mentioned the best approach for reliable delivery is to disable the subscribing trigger as soon as the service it invokes (in this case the DB interface service) starts failing. Is this correct?
We have used this approach for one of our interface, which requires guarantee delivery.
It is working perfect. But I have some suggestions.(u might have already done that). It might help you in estimating the trigger properties value.
What is the average size of the incoming docs?
How many docs the IS pushes to the database every day.
If for some reason the database is down, What is the max down time of the DB.
The answer for the first two questions will give you an estimate of work load on the IS(memory and storage) that will be imposed due to the down time.
The answer for the last question will give u an estimate for the “max attempts”, “retry interval” trigger properties.
You can switch the “deliver until” trigger property to “Success”, however, if the db down time is very long, it might affect your IS. SO, i would definitely go with “Max attempts reached”.
And you said you changed these settings:
I changed these settings in the trigger:
Retry failure behavior: ‘Suspend and retry later’
Resource monitoring service:
As Srikanth said, the properties are the trigger properties which you can edit in Developer (I am using Developer for Fabric 7). You do have to write a new service to monitors your resource - when the DB goes down, the trigger automatically suspends, and starts a new scheduled service that runs my new monitoring service once a minute or so, until the DB is back up again. Till that point, documents queue up in the broker.
I tried the retry parameters earlier, but it’s confusing (we have retry in the trigger, as well as service level retry). I think this trigger ‘suspend and retry later’ process is a better and simpler way.
Sonam,
That’s a good approach. The key i think here is to have a reliable way to detect a resource exception and raise it? Can you tell me how did you achieve it? Like, how did you determine if the failure is caused due to a database going down, database filling up, mapping failure etc.?
Hi Sekay - in our case, we had a ‘ping’ service available in our resource, that checked it’s availability.
The extent of checking depends totally on you. For example, if the resource in question was a DB connection to an Oracle server, you could do a simple ‘SELECT SYSDATE FROM DUAL’ statement to verify the DB was up. Further SQL statements could be more comprehensive - for eg, inserting and then deleting a dummy record. Personally, I would not do the second check because the resource check service does not need to be comprehensive - if the trigger is incorrectly reactivated by the monitoring service, and an exception is still thrown, the trigger will just suspend again.
Sonam,
The retry count approach also helps you handle other service exceptions than the DB down or other legacy sys down.
Actually it is easy. Here is how we implemeted it
This will ensure the guarantee delivery. The only downside is if the property Max attempts is set to “Successful” and if the exception is not resolved on time, it may affect the IS.
The ‘suspend trigger and retry later’ approach kicks in in the case of any service exceptions (same as the ‘retry count’ approach)
However, it is superior for one main reason: if IS happens to also go down when still retrying deliveries, documents being retried are lost. With the ‘suspend trigger and retry’ approach, the undelivered documents are still queued up on the broker itself - hence documents are never lost even when the delivery endpoint goes down, IS goes down, broker goes down, or any combination of outages - the documents are simply retried until successfully delivered. Also, queueing up the documents on the broker (which has persistent disk based storage) reduces the load on IS.
Thanks Kerni - the ‘suspend trigger and retry later’ approach also handles any service exception - it is not specific to database exceptions.
It is better for one main reason: undelivered documents are still queued up on the broker. Hence documents are never lost, whether the delivery endpoint is down, IS goes down, broker goes down, or any combination of such outages - the queued documents are simply retried until successfully delivered. Queueing up the documents on the broker (which has persistent disk based storage) also reduces the load on IS.
With the ‘retry count’ approach, if IS is retrying, and then IS itself crashes - the documents being retried are lost since they were only in IS memory and had been already removed from the broker client queue.
I have an integration scenario where IS processes a broker queue. Some documents on the queue may have data errors that cause the trigger service to fail. (call these “bad documents”.)
If the trigger hits a “bad document”, I would like it to move ahead and process other documents in the queue. i.e. “bad documents” should not hold up the processing of “good” documents. However, I would like the “bad documents” to stay in the broker queue, so that they are not lost when IS goes down. Also, I would like to browse/export/delete ‘bad documents’ at leisure through the MWS broker queue interface.
The documents do not have to be processed in order.
The trigger properties windows offers up two behaviors to cope with retry failures:
(1) Throw service exception - the problem with this is “bad documents” are removed from the broker client queue.
(2) Suspend and retry later - the problem is that both “good documents” and “bad documents” have their queue frozen. When the bad document is processed again, the queue is again frozen.
Kerni had mentioned this regarding the ‘Deliver until’ = ‘Successful’ trigger setting:
This will ensure the guarantee delivery. The only downside is if the property Max attempts is set to “Successful” and if the exception is not resolved on time, it may affect the IS.
Earlier, I thought that the trigger has ‘Deliver until’ = ‘Successful’ setting that Kerni mentioned cause documents to be lost on IS restart. However, I just tested this setting and documents are still queued on the broker on IS restart, until successful execution of the trigger service.
However, I found that having a trigger service fail for more than an hour for a significant number of documents (100 or so) may lead to IS hanging.
This does not happen with the ‘Deliver until’=‘Max. attempts reached’ combined with ‘Suspend and retry later’ as it seems more efficient (a resource monitoring service is executed every minute instead of retrying all trigger services). So I still prefer the ‘Suspend and retry later’ approach.
I have a requirement where I have to create a log file when the trigger service is retrying on transient error. I have set the retry property in the trigger property panel. The log file needs to log the message each time the service is retrying. After the max retry count is reached and the trigger gets disabled I need to log that message that the trigger got disabled as well. Finally when the resource monitoring service gets called to check when the resource is available I need to log the message that the trigger along with the trigger name got enabled. Can someone help me with this? Is there a way to get the trigger name which is calling the resource monitoring service?