There has been a general requirement of EAI solution to hold the messages in queue in case the target system is not available. In case of webMethods, this means that when a document is published and if target application is not available, the IS with subscriber trigger should automatically be shutdown because otherwise the service will anyway be invoked and will fail to to the job (e.g.update the target database).
Since the expectation is that the message should be held until the target is up again, this design fails to deliver.
Has anyone faced this (I think most of you must have) and can suggest any simple solution?
There is a setting in each IS trigger for retries. If your IS service uses the throwExceptionForRetry service to report the error that the target system is down, then the IS will automatically retain the document and retry it later. You can set the number of retries and the interval between retries in the IS trigger. This is for webMethods 6.1.
I believe this route of retry has been designed for transient errors. But if the target system has been down for elongated period say 2-4 hours, this may not work because the setting for retry will normally be in seconds. There may be one more issue, that is, the documents order will be lost and there will be, depending on the traffic, lot of such documents which may be set for re-delivery.
Ideally there should have been a mechanism to hold the messages in queue itself, i.e., not even send to trigger. And that sould be possible to be enabled by identifying the type of error in invoked service.
That’s what is happening with the above solution. The documents are never leaving the queue until the successful retry completes. Only the first message is in a retry mode so document ordering is preserved. Unless you have parallel processing turned on in which case you are not concern about ordering in the first place. The length of time the system is down is not an issue.
Having said all that however, I agree that the software itself should have this capability out of the box without the above steps being necessary. Some of their adapters do have this capability ie webSpehere MQ adapter has this capability. I think it should be template driven and not left up to the programmer to have to figure out. For very large infrastructures, database maintenance can become a real issue if all the services are not coded correctly.
Rety On Error trigger setting by combining two parameters (Deliver Until Successfull And Interval time) will solve your problem. The order of documents will not change because the first document in the queue will be the one in the retry loop until it is delivered.
FYI.Enterprise version adapters have this capability in built as deisgn, since the adapter is making a client connection not a pool connection. So once the database goes down the adapter also turns red and all the documents are stored in the queue.
I was a bit insecure using the retry until successful option due to the caution noted by webMethods documentation itself. The advised approach in general, not to resolve hold until ready problem, is to use retry until max attempts. If anyone has used this and can share the observation of server behaviour (if it really becomes unresponsive during retries), it will help.
I see that you have already gone ahead with the “Max attempts reached” option. Just wanted to add a couple of issues to this chain regarding infinite retry in case anyone is interested.
Issue 1: The trigger property “Deliver until” doesn’t works as expected when it is set to “successful”. I see that in our environment on WM 6.1, it rejects the document after a single attempt.
The WM support provided a “Fix 15” for IS that fixes this issue as well as provides for interrupting the infinite loop - a problem the WM documentation warns about - when an IS shutdown is initiated.
Issue 2: This is on specifying Time-To-Live property for infinite retries. On the subscribe service, we generally catch and rethrow Runtime exceptions for the trigger to retry. Before we throw this exception, we can also perform a check against the current time vs. the time of publish of the document. We can throw Runtime exceptions only if the time difference is less than a Time-To-Live period.
Since we are using a generic error handling service to handle any type of exceptions, it makes it all the more easy for us to set a TTL parameter as input to this service from our main service.
I must confess that we have not tried this solution with TTL yet, but I don’t see why it shouldn’t work. If anyone sees a problem with this approach, please let us know.
How about this solution ?
Use the Resolver service (pub.publish:documentResolverSpec) to check the nature of the error and decide what action(s) should be taken. E.g., you can send an e-mail/page to the administrator to disable the service based on retry-count and the last error.