Handling of mass data

We use wM to synchronise multiple systems which each other. That means that every change on one system will be published as a document on one IS and is processed by the receiving IS. We are processing up to 500.000 documents per day.

The first problem is how to handle parallel processing of documents. We know that you can set a trigger to parallel processing but this does not work for us because the sequence of messages is important (e.g. create message must be processed before update message for the same ID).

We work around this issue by using a numerical ID for each data type (e.g. order number) and set up 10 sequential triggers. The first trigger handles all IDs ending with ‘0’, the second one all IDs ending with ‘1’ … and the tenth trigger handles all IDs ending with ‘9’. This at least preserves the message sequence for every ID.

I do not know why it is not a standard feature of a wM Trigger to be able to set one field of a document as “sequence ID” and all documents for one ID are handled sequentially. We found that almost for all data that we publish the message sequence is important.

Do you know of any other approach for the message sequence problem while trying to do parallel processing?

The second problem that arises from our solution to the first problem. The I/O throughput of our wM-Sun-Solaris-Server seems to be a performance bottleneck (as reported by “iostat -x 5”). The triggers are not able to deliver documents to the flows fast enough. We think this is caused by “guaranteed delivery”. Every document will be buffered on disk on every step (on publishing side and on subscribing side) in order to prevent data loss. This causes too much I/O load especially when working with a lot of triggers.

What can we do about this besides buying a faster harddisk?

Thank you,
Ulf Licht

Ulf,

For each published document, there is a sub-document called “_env”. This is the broker envelope or the meta-data relating to the life of the document on the broker. Look towards the bottom for a string field called “pubid”. This is the unique publication ID that the broker assigns to the document when it is published and it is what is used for tracking purposes. To my knowledge, it is unique.

Regarding your trigger sequencing problem, you can set the trigger to handle each document sequencially rather than concurrently. This means, however, that the documents will be processed FIFO (First in, first out.)

HTH,

Ray

Ulf,

I forgot to answer your second question regarding I/O. You can set the Document Trigger Stores under the admin console. Click on the link under settings labeled resources. There is a section detailed Document Stores Settings.

The way it works is that the trigger picks up the document from the broker and stores it into the trigger store. When IS figures out which trigger receives the document, then the doc is moved to the document store. The trigger itself contains settings under the settings also called document store, but are only relating to this specific trigger. You can set these as well.

So in recap, you can optimize the settings for the overall integration server document store, and you can fine tune the settings per trigger.

Hope this makes sense.

Ray

Ray,

first thank you for your answers. Maybe I need to further clarify our problems. Lets start with the message sequence:

The standard wM approach for parallel processing of documents is to have a single trigger that is set to parallel processing. This helps to speed up processing but you loose the sequence in which the documents were published.

Lets use order data as an example. In the source system order “01” is created then order “01” is updated. After this order “02” is created and then order “02” is updated. It is important to keep the message sequence because we have to process “create” messages before “update” messages in order to prevent errors.

If you think about it a little further it would be enough to keep the message sequence per order ID. It is ok to process the insert of order “02” before the update of order “01” because they are two separate orders. But you must not process update of order “01” before creation of order “01”.

It is possible to use parallel processing but one has to make sure that orders with the same ID are always handled by the same thread in order to keep the message sequence per ID. This can not be done using a single wM trigger.

Our approach (for numerical IDs) is to set up ten (10) triggers that each handle the order documents that end with a different number (0-9). This guarantees that orders with the same ID are always handled by the same trigger. Each trigger is set to handle documents sequentially because it represents one thread that processes documents.

Of course this approach causes more I/O load because more triggers are used. But this is the second topic that I will talk about in another post.

I just wondered why this is not a standard feature of a wM Trigger. The message sequence problem appears to be all over the place (at least with all of our integration tasks). A wM trigger needs another setting where you can set one field of a subscribed document as an ID field which will be used to distribute the documents to the threads when the trigger is set to parallel processing and guarantees that documents with the same ID are always handled by the same thread.

Or maybe there is a completely different approach that we did not think about yet?

Unfortunately the handling of documents with different IDs is not a standard webMethods trigger feature. The approach that you describe, using different triggers to split up the documents into ten groups, is one way to work around this. I did something very similar on a project myself, although in my case I had to be a bit more complex since the different groups also had priorities (i.e. all type “1” docs had to be processed before any type “2” docs could go through).

As to why this isn’t a webMethods feature - my guess is that it’s hard to implement. You could certainly call up the support line and put in a feature request for it.

One thing to be careful of - when you set up your triggers to receive the different order numbers, make sure you use broker filtering to determine which document goes to which client. This means that your filter should use the lexical operators, as described in App. B of the Publish-Subscribe Guide. The lexical operators will cause your documents to be filtered on the broker, not the IS, which is much faster.

I’d be interested to know more about the performance issues you’re seeing. What architecture do you use? Sample architecture questions - IS and Broker on same box? Single IS or cluster? Do you use Trading Networks? Is your database for TN/audit logging/etc. on the same machine or somewhere else? What version are all your wM components, databases, OS, etc.? All these things can affect performance.

Ulf and Skip,

webMethods actually contains this functionality, but it’s not out of the box in the form of a trigger.

Basically, a trigger is a broker client with some properties that you can set. You can enable or disable and set how the documents are removed from the broker queue (parallel or serial).

For the functionality that you both seek, examine the Broker Client API for Java that is resident in the /broker/doc/ directory.

I coded a simple java client on the integration server to test SSL connections from the client to the server in about five minutes.

The java api provides methods to getEvent() or getEvents() - parallel or serial. You can examine the contents of the document fields and if you do not wish to process the document, you can put it back on the queue.

However, to get documents in the order in which they reach the broker by number, you have to pull serially.

HTH,

Ray

Ray,
I will look into the Java broker client API. Maybe I can build a more elegant solution myself using the API that does not require multiple parallel triggers to be setup. This might help improve performance because we do not need multiple triggers, which each cause more I/O load.
I took a quick look but maybe you can help me, how would you implement parallel processing using the Java API? Do you set up a number of threads which call getEvent()? Or is parallel processing already included in the API?

Skip,
here is the part of our setup that causes performance issues: Two integration server and two brokers on one physical Sun machine. One IS with one broker is used for one backend system. Currently nothing is clustered. IS version is “IS_6-0-1_SP2”. Operating system and hardware is “SunOS 5.9 Generic_112233-03 sun4u sparc SUNW,Sun-Fire-880”.

We have a problem with I/O performance because the documents are written/read to/from disk at every step of the way (IS1->Broker1->Broker2->IS2). At least this is what I suspect by looking at %b > 90 on “iostat -x” output and seeing that “Recent Deliveries” values for the trigger clients decreases as soon as more triggers try to process documents.

We do use broker filters in order to dispatch documents to the triggers on the broker.

I see options to improve performance:

  • disable “Inbound Client Side Queuing” in order to prevent documents to be written/read to/from disk when received by IS2
  • move “Trigger Document Store” to faster or separate disks
  • move broker storage file to faster or separate disks

But maybe someone knows other practices of handling mass document data with webMethods?

I thank both of you for your answers!

Best regards,
Ulf

Hmmm, this sounds like a problem that might be addressed by a implementation based on the “Resequencer” pattern as described in Gregor Hohpe’s “Enterprise Integration Patterns” book on pp. 283-293.

Amazon: http://www.amazon.com/exec/obidos/tg/detail/-/0321200683/qid=1086277021/sr=8-1/ref=sr_8_xs_ap_i1_xgl14/104-7653998-7444748?v=glance&s=books&n=507846

A resequencer reads messages in arrival order and uses a sequence number to buffer the incoming messages until the buffer contains an in-order sequence. Then the resequencer republishes the messages in order.

I agree with Ray that you may need to implement the “resequencer” as a java service that uses the Broker Client API.

It may also be possible to design the resequencer such that it buffers all create messages until the corresponding update has been received and then publishes a combined message that could be consumed by a normal IS trigger that would invoke a Flow service to do both operations (create and update). (Hmmm, sounds like a join condition to me. Maybe you could use a trigger for this.).

Mark

Ulf,

Thanks for the architecture information, it’s helpful to see what you guys have set up. I’m a bit confused by the 2 Broker setup - why not just connection all the ISs to the same Broker?

Your three options for improving performance are good ones. You can also make the document volatile instead of guaranteed, but then you would need to have another method of recovering any errors. I’ve done this before with resubmission capabilities on the source system. Not sure if that’s an option for you, it depends on your particular requirements.

Ulf and Ray,

Ray is correct that the Broker API can be used to receive documents to the IS, just like a trigger. However, it is quite a bit of work to set it up, and you lose the client-side queuing capabilities of the IS. That may not be a big deal if you’re thinking of disabling it anyway for performance reasons.

If you decide to go the Broker API route, then you’ll probably want to do it on both the source and client side. On the client side, you have a single Broker Client that uses shared state (see Sharing Client State in the Broker Java API Dev Guide, pg. 67 in the 6.1 version). Set the shared event order to SHARED_EVENT_BY_PUBLISHER. You can create as many connections to this same client as you like - each connection represents a different thread to use in processing.

Here’s how the process would work:

  1. On the source side, use the Broker API to create a client with a unique client ID based somehow on your order number, so that creates and updates for the same order use the same client ID.
  2. Publish your document using that client.
  3. Destroy the client.
  4. The document goes across the Broker (or two Brokers in your case).
  5. The target side client receives the document and delivers it to one of your threads. The thread that gets the document depends on the publisher’s client ID, so you don’t have to worry about orders getting out of sync.

Now this is pretty complex to implement, so you may not want to go this deep into the API. And you could run into other issues with the Broker API itself - I’d be worried that creating and destroying all the source side clients could cause problems, so you’d want to test that extensively. Still, it’s another idea on how you might get this to work.

Gents,

One correction:

pubid corresponds to the broker clientid.

Use the trackid in the env document.

Ray

Thank you for all your suggestions!

Skip,

the environment pattern “one IS and one Broker for one backend system” was setup by a wM consultant when we first installed wM at our company. The advantage is that you can shut down any part of the setup for maintenance without stopping the complete enterprise integration. Only one backend system will be disconnected in case of failure or maintenance of one IS and/or broker.

In order to improve performance I did take a look at disabling “Inbound Client Side Queuing”. When trying to do it locally on my 6.1 environment I noticed that the feature does not exist anymore (“not available”). It seems that Trigger Document Stores are now always in memory.

Now we will try to disable “Inbound Client Side Queuing” in our 6.0.1 test- and production-environment to reduce I/O load.

Ulf