Error handling best practices

I have been trying over the past few days to come up with a good way to trap errors. It is not so much that we are having trouble trapping the errors, as we can’t stop error emails from being generated even if we handle the error within the catch block.

We use a sequence like the following:

[SIZE=2]SEQUENCE1 (exit on SUCCESS)
…SEQUENCE2 (TRY - exit on FAILURE)
…Service1
…Service2
…SEQUENCE3 (CATCH - exit on DONE)
…pub.flow:getLastError
…errorHandleService

So for instance, Service1 throws a service exception. We catch it and return an error back to the client or update a db table, etc. However, it will also send an email off to our admins notifying them of an issue. Currently they are getting several thousand emails a day making it nearly impossible to actually work the real issues.

I tried utilizing the exit flow statement in conjunction with a branch, but when you exit, you cannot get the error information.
SEQUENCE1 (exit on SUCCESS)
…SEQUENCE2 (TRY - exit on FAILURE)
…Service1
…Branch /Service1Output
…Exit TRY - signal Failure
…Service2
…SEQUENCE3 (CATCH - exit on DONE)
…pub.flow:getLastError
…errorHandleService
I tried setting error variables within the branch, but when you exit out of the SEQUENCE2 block, those variables are removed from the pipeline, and you do not have access to the failure message via getLastError. This makes it impossible to report on what is actually wrong with Service1.

By using a custom java service, we can generate a ServiceException that contains the proper error message, or you can wrap Service1 and the branch into a separate flow using Exit $flow from the sub flow (generating a FlowException versus the ServiceException). With these methods you can get access to the entire pipeline and the specified failure messages. Both of these approaches will generate error emails even if the code properly handles the error.

The another possibility is to have nested flow like the following:

SEQUENCE1 (exit on SUCCESS)
…SEQUENCE2 (TRY - exit on FAILURE)
…Service1
…Branch /Service1Output
…SEQUENCE (error)
…errorHandleService
…SEQUENCE (nonerror)
…Service2
…Branch /Service2Output
…SEQUENCE (error)
…errorHandleService
…SEQUENCE (nonerror)
…Service3
…etc
…SEQUENCE3 (CATCH - exit on DONE)
…pub.flow:getLastError
…errorHandleService

This approach allows us to recover better from expected from expected errors. For instance Service1, would return an errorMessage on the pipeline instead of using the build in Exit steps. However, this method is much more time consuming and makes the flow very difficult to follow as you get 10-20 service calls deep(not to mention all the branch steps and error pipeline values involved).

Yet another solution that appears to work is to wrap error generating services inside of a custom java service that uses the doInvoke calls. This way you can trap any exception via standard java try/catch and only pass through unexpected exceptions. This works for most scenarios, but the idea of wrapping all of our services in a custom java service just seems hacky.

The only other thought I have is to disable the error email generation all together and only send errors in the catch block when appropriate. My concern here is that there is no fail safe. If something happens that the developer did not expect, then no email will be sent. I would prefer to send emails by default, but give the developer the ability to supress known conditions.

How are other people handling these situations? We have 20+ IS’s in production not counting the HA and DR sites with hundreds of flows. The current flows are generating thousands of emails a day making it nearly impossible to support the real issues.[/size]

We disabled our admin service error alerting. It provided no value except to test the capacity of our Exchange system. :smiley:

Your catch block should catch errors you are concerned about both runtime and service exceptions. You should probably also look at how you are monitoring your integrations for success and failure (easier said than done). Depending on the error handling including the buillt in service admin notification doesn’t always tell you the health of your integrations. For example if things are hung or not flowing, there may not be any errors being generated.

We have not had any issues with it disabled. Disclaimer: But you should make sure you are catching the right things before doing this.

Our service error email alerts are turned on, but the scale of our operation is much smaller than yours and it hasn’t caused us a problem (yet). Perhaps this KB from Advantage about event handlers and subscribing to events would be of interest to you: https://advantage.webmethods.com/advantage?targChanId=kb_home&oid=1612242885

HTH,

Tim

Thanks for the quick replies. I am leary of disabling the emails because there could be valid error items coming through. Perhaps the best thing to do is to use the event manager described in the KB above. Then we can programatically filter out the ones that we don’t want to report on before sending out emails with the smtp service. Once we have that set up, we can disable the automated error emails relying upon our error handling service to filter things appropriately.

You can do some steps to reduce email notifications:

  1. Mechanism to send one error mail if same error occurs multiple times in a span of say 5 mins
  2. Count of errors which are unknown can be put in db and can be given a unique number (say "Adapter error ADAX.X.X occurs everytime you do some opertation, give an ID to this error and update count)
  3. After a period of time (say 3 months), review unknown erros which have high count and if there is a solution then move those errors to known error list and attach a solution to be send with error mail

Hope this helps.

Regards,
Sumit

adkinsjd,

My one-cent contribution: does your getLastError`s output has a “pipeline” in its output? This pipeline usually contains the context pipeline at the moment the error was thrown. You may have some insight from there.

Another dime here: regarding the exit flow: you can use a java service to throw a ServiceException (with a custom message) in any place to take a snapshot of the pipeline’s contents. And this can give you a further insight of how to handle the errors.

Another cent makes three: you are in danger of loosing performance by making uber-complex services. I have yet to run into an exception that was not caught by the basic try-catch flow and for which the pub.flow:getLastError could not return me sufficient information. As redbrick mentioned, the pipeline variable in the lastError data but also the nestedError structure should provide all you would need.

Do you have any specific exceptions you have found not to be handled by the basic try-catch? Can you post an example?

Chris

Any good place where error handling is documented?