How to fork and join process threads (w.r.t. the data)?

Hello.

In my process model, I have a sequence of steps, then a forking gateway at which two threads start. Each thread makes some changes to the documents contained in its pipeline. After that, the two threads are And-joined.

Schematically:

[FONT=courier new]
...--> A --> Fork --> B1 --> B2 --> Join --> D ...
                \                  /
                 \--> C1 --> C2 --/[/font]

If the step “Fork” puts something (a document) to the pipeline, then both B1 and C1 will have this document in their input pipeline.

I have a question and a problem here.

The question

If I modify a field of the document in, say, B1: is that change immediately visible in C1 or C2? I.e. is the process pipeline more like a global process variable or more like a “thread local” variable (“thread” as in “process thread or branch”)?

The problem

In the step “Join” I’d like to gather changes made in the two threads. But both threads have the document under the same name. Should I then declare two input documents with the same name and type as the input of the service that backs (implements) the step “Join”? Or how else can I tell one document (coming via the thread B) from the other (coming via the thread C)?

A clarification will be much appreciated!

If you are able to modify the field in a document, you have created it as a run-time document? I mean to say NOT as ‘document reference’.

You can pass a document parallel to B1 and C1, but when the same returns to D, how would the step D know which one is from B1 or C1 if both the names are same? I dont think you should modify the data structure inside a processing step when the same doc is passed to another flow as well.

-Senthil

I meant “the value of the field”, not “the document structure”.

This is exactly what I’m asking about! :slight_smile: The next thing I’m going to try is to create two copies of the document and place them into the pipeline under different names. Each thread gets its own copy. I’m not sure where is the best place to do it: before or after the fork. I.e. should the step “Fork” create the copies and hand them to the appropriate thread, or should “Fork” just give the document to B1 and C1, and they would create their copy for “internal thread use” and use it. Either way, the step “Join” would get two documents with the same structure, but with different names.

  1. Call service A and spawn two threads.
  2. In B1 create an IData object (temp1Doc) and populate it with the required fields or docs or doc lists.
  3. In C1 create an IData object (temp2Doc) and populate it with the required fields or docs or doc lists.
  4. Use the future API of Java to get the result sets from both the threads and then merge temp1Doc and temp2Doc using IDataUtil.merge to form the Canonical Doc (or target doc).
  5. Call a validation service to validate the merged doc with a schema (to ensure the doc structure is valid to canoical).

You’ve probably misunderstood me. I do not mean that I have a service (java or flow) from which I want to spawn java threads. What I have is a process model where I have branches. I also called them threads, but these are not java threads (java.lang.Thread).

I thougth that would be clear from my description and from the fact that I posted the question to this forum (and not to “java and flow services”).

What is driving the use of a fork? Would things be much simplified if the steps were serial?

Each branch has an intermediate message event (of receive type), and the wait time is data driven.

But regardless of that: is the wish to fully understand the technology not enough? If BPMN allows to easily fork and join branches in process model then I would like to know how to use the feature, even if only in order to be prepared for the “real case.”

That’s a fair question. Sometimes understanding the technology for its own sake is fun/useful.

But I often find that when someone wants to multithread things in an integration, it is almost always unnecessary. Speed is almost never a primary concern for unattended/automated integrations. So working through the issues that arise from introducing multithreading becomes suspect.

The root issue here is two different threads modifying the same document. That to me is a non-starter. That is why I asked what is driving the desire to use fork.

If they are 2 independent documents but with the same structure then what you described in one of your earlier posts would be the way to go. I would assume that “merging” the 2 documents is technically reasonable, with some fields coming from one doc and others coming from the second.

I know you’re here looking for advice/guidance on the “best” way to accomplish this. The “best” way will be whatever works for what you are trying to do. Given the participation on the thread thus far, either the folks who have travelled this “what if both paths modify the same document name” path before are not here, or you’re covering new ground.

A valid point! :slight_smile: I’m fairly new to this technology and the product (I assume different vendors have differences in how they implement the BPMN spec, if this behaviour is defined in the spec at all). But I thought this would be a rather common pattern. Not to be used everywhere, only when really necessary, but still it should be on top of the head of a BPMN developer. That’s why I hoped to get an answer here.

BTW: The wm process developer guide states that parallel updates to a document are not supported and can lead to unpredictable results.

Hello, fml2!

I am on a journey to try and understand what the heck is going on in this engine so that I can feel like, when I put a bunch of tasks on a page and join them together, that I can anticipate what might actually happen. The entire thread thing and lack of any associated information is kind of blowing my mind. There does not seem to be any concise description of when a new “thread” is started, when it goes away, what is happening in the pipeline when there are multiple threads, or what happens if you don’t enable parallel execution on a step and it finds itself being called from multiple threads. Does this parallel thing only apply when multiple threads are calling the step at the same time? The way that time affects these processes is unclear to me.

Does a receive task always create a new thread? What kinds of things cause that new thread to go away? Any synchronized join? I’m thinking, “not necessarily”. It sounds like receive tasks are potential mayhem factories. Are they?

Say I have a process that goes Start-> AND → X, Receive → AND → X, … and I fire 30 messages at the receive, then I assume that I would get 30 threads going, and they would each hit X because the transition from Start is always true. What technique could I possibly use at some point downstream to guarantee that I’m at exactly one thread? If I AND in something that only occurs once, do all of those 30 threads all essentially terminate at that join? If they do and I get some more messages after that, do they each sail through that join on their own threads because the join has been satisfied once? Do I need to include some mechanism to turn off the flow of data from that receive event when I no longer should be accepting it? What would that look like?

And why to I get build errors when I have a synchronized OR in a loop? I don’t doubt that the builder is correct. I just have no real idea why this is a problem due to my ignorance. Understanding that might also help me understand what’s happening with these threads.

Now, since I don’t know what I’m talking about, any of the above assertions may be false. But I can find no documentation that either supports or refutes any of them. So there’s my straw man. Someone please burn it for me.

If I don’t get any action on this, I may re-post with a different subject that’s more titillating…

Jason

Can a Send task end a thread like a Receive task can start one? Can I use a send task to dump a thread that I don’t want? Can anything (else) terminate a thread without joining to another one? End tasks shut down the whole shooting match, right? Not just one thread?

It seems like each thread would have to have its own copy of the pipeline that would be merged in with the others at the join. Otherwise, multi-threading would be completely impossible…

This business of figuring out how to dump extra threads is reminding me of digital circuit design. Everything needs to be connected some kind of output, even if it’s just a resistor to ground… I’m finding myself putting an AND gateway right before the End task where all my optional (multiple) threads all terminate, along with a transition from the start task and a transition from my critical path. I feel like I’m off the rails as soon as I wander outside of the black box.

Hi All…

I have not read all messages, providing my $0.02 based on first message.

The question

If I modify a field of the document in, say, B1: is that change immediately visible in C1 or C2? I.e. is the process pipeline more like a global process variable or more like a “thread local” variable (“thread” as in “process thread or branch”)?

RAJ: to answer this question, Doc1 arrived to Join (Lets name it to G1) from B1 as first transition, C1 will not have any knowledge because process still waiting at G1, when join gets satisfied from transition from B2, what exactly happens here is, Doc1 (B1) gets overwritten by Doc1 (B2), C1 will have only latest Doc1 which means B2 updates but not B1 ones.

The problem

In the step “Join” I’d like to gather changes made in the two threads. But both threads have the document under the same name. Should I then declare two input documents with the same name and type as the input of the service that backs (implements) the step “Join”? Or how else can I tell one document (coming via the thread B) from the other (coming via the thread C)?

RAJ: As a obvious choice, you need to handle this issue having two different docs, and map or merge somehow at C1

Best
~Raj