Tasks in clustered environment

Radek_J · July 10, 2014, 6:24pm

Hi!
I have a question about Processes and Tasks in cluster environment. Imagine a situation that a long-running task is executed on IS1, IS dies for any reason. This Task is still in running state. Even when I set a timeout and the task fails, upon resubmission the same IS1 server is required… (server is still down).

Is it possible to configure IS/PRT so that upon server failure the task will be resubmitted on different IS? Or am I missing something? Tried both wM8.2 and wM9.0 with Broker as JMS.

system · July 13, 2014, 7:42am

By IS you mean Integration Server? Task engine resides on MWS.
In your case if you are resubmitting only the process step where the task is then it can fail because the task contains calls to IS which is down.
In this case you can point your webservice calls to the cluster url or the load balancer so that it automatically redirects the call to active IS.
let me know if this helps.

Radek_J · July 14, 2014, 2:46pm

Makes sense.

You mean Task Engine url should point to cluster/LB address?

system · July 14, 2014, 6:27pm

Yes that should be done.
but in your case the problem is re submission fails due to IS being down, correct me if I am wrong?
If this is the problem then you need to configure your IS calls from CAF code/Task to point to IS load balancer.
check your CAF Application runtime configuration and point the wsclient-endpointAddress to the IS load balancer URL.
let me know if this is unclear

Radek_J · July 15, 2014, 7:17am

Yes, resubmission fails when IS is down. Task service is pointing to Node that is done so setting LB should solve the problem.

There is another, maybe u can help. When running a process, one of task is a long-running IS service. While the service is running Node goes down. Task stays in running state until the Node is back up. So I have moved the long-running service to sub-process and set a timeout after which the sub-process will fails.

Is the re a better way to handle failover?

system · July 16, 2014, 7:05am

If the time to execute the service from task cannot be reduced/optimized then this might be a workaround.
alternatively can you try increasing the timeout for the webservice call so that task can wait while the node is back up.
you can do it only for the specific CAF portlet where this problem is. Search your portlet in CAF application runtime configuration and increase “wsclient-socketTimeout” this value (in milliseconds), lets see the results.

Radek_J · July 16, 2014, 9:55am

Hi!
I don’t use CAF projects. Only Tasks. Anyway thanks for clarification!