Process Instance Stuck / hangs

Hi
We have a strange issue that happens sporaidically. Generally, BPM is working fine for 90% of the instances. However, for few instances, while execution moves from one step to another, it just gets stuck (not a data issue). The process status shows STARTED though no activity steps in that particular step. We increased the debug level and captured the following in the log. Could anybody please assist if you are aware of what these messages mean?

Best Regards
Sivaraj Lenin

14:52:47 GMT [PRT.0101.0278V4] (T5) Incoming process transition event
14:52:47 GMT [PRT.0101.0279V4] (T5) process instance ID: bb5bd6f094c411da835d9e3f4f586d02:1
14:52:47 GMT [PRT.0101.0280V4] (T5) process model ID: P1RS2RTT028
14:52:47 GMT [PRT.0101.0281V4] (T5) source step ID: N2
14:52:47 GMT [PRT.0101.0282V4] (T5) target step ID: N3
14:52:47 GMT [PRT.0101.0283V4] (T5) pipeline: 23,002 bytes
14:52:47 GMT [PRT.0101.0310V4] (T5) starting step execution (pid=bb5bd6f094c411da835d9e3f4f586d02:1, sid=N3)
14:52:47 GMT [PRT.0101.0312V4] (T5) running step synchronously
14:52:47 GMT [PRT.0101.0335V4] (T5) beginning synchronous execution at step N3
14:52:47 GMT [PRT.0101.0341V4] (T5) executing process instance bb5bd6f094c411da835d9e3f4f586d02:1
14:52:47 GMT [PRT.0101.0342V4] (T5) model=P1RS2RTT028, step=N3
14:52:47 GMT [PRT.0101.0343V4] (T5) fragment and step validated
14:52:47 GMT [PRT.0101.0346V4] (T5) checking process status
14:52:47 GMT [PRT.0101.0348V4] (T5) status = STAT_UNKNOWN
14:52:47 GMT [PRT.0101.0349V4] (T5) step iteration = 1
14:52:47 GMT [PRT.0101.0353V4] (T5) evaluating join
14:52:47 GMT [PRT.0101.0001V4] -----------------------------
14:52:47 GMT [PRT.0101.0428V4] (T5) PRT queue processing for step N3
14:52:47 GMT [PRT.0101.0429V4] (T5) process instance ID: bb5bd6f094c411da835d9e3f4f586d02:1
14:52:47 GMT [PRT.0101.0430V4] (T5) event queue empty, nothing to do
14:52:47 GMT [PRT.0101.0001V4] -----------------------------

Hi Sivaraj,
If you are using a single instance of the IS. Please enable local correlation. This will make sure that all data flows to all the steps.

Thanks,
Praveen

Hi,

we are facing the same issue (IS 6.5.2). We are trying to run two non-clustered IS with PRTs pointing to the same DB and in about 10% cases we see the same issue as described above. If we switch off one of the IS the issue disappear.

Did anybody resolve this problem?

Thanks a lot for help

Peter

we had same issue like transition between steps is taking more than hrs, that time we cleaned up some space on the DB. (PRT seems to be very slow)

I’ve seen this issue before but in 6.1. WM could never give me root cause on it. Query the wmprocessstep table and see if the “hung” step attempted to execute on the IS node that all other steps did not execute on. The only reliable solution we could ever come up with was to enable “optimize locally” - though this has its downsides as well…

I’ve seen the suggestions by arulchristhuraj and jlammers be quite effective at a client were PRT was used extensively. The biggest stability gain was keeping the DB records trimmed to about 2 weeks worth of data. The optimize locally was also effective–and I wonder about the real value of having the steps of a single process bounce around multiple IS instances anyway.

Problem is that we cannot use the optimize locally option as we need to be able to resubmit the process in the middle, in addition the process might run also longer time and we would not be able to recover it in case of server failure.

I’m currently playing little bit with DB performance, but still I assume this will just minimize the possibility of this issue, but not prevent it. Nevertheless I agree that if the possibility will be low it might be an acceptable workaround for us.

I see. The client I mentioned designed their processes (indeed, all their integrations) such that nothing would ever be restarted in the middle.[/color]
[COLOR=black]

I’m not sure I understand this concern. Publishing documents for step transitions is generally slower then the direct invokes the Optimize Locally enables. Can you elaborate?[/color]

[COLOR=black]

Purging old data at the client I mentioned eliminated the problems. They have not had hung processes since. Doesn’t make sense but that’s the observed behavior.

Regarding the long running processes, I was wrong in my concern, as I forgot that in that case there are publishable inputs, which are acting as asynchronous elements.

P.