Splitting of document List

Hi All,

Our flow service is receiving one document list with iteration of many records(almost 5-6k records) via ws calling.
Now before processing this document in single service with looping we want to divide this document with batch of 100 records and to implement pub-sub mechanism to improve processing time.
What can be optimize way to divide documents with 100 records without looping over whole document.

Thanks
Baharul Islam

If you are a core java expert write a java service instead of using LOOP.

Yes I also think some java service will solve the problem,but not getting the proper logic for the same.

I have make some jave code for this.But to get output docList in the same name as input docList I have to pass the documentName to write IData Variable.Is there any way to get the name of input DocLIst that is being mapped to my java service inputLIst on run time.

Thanks
Baharul Islam

Baharul -You can pass the name of the argument as one of the input parameter’s to the java service. Let me know if you have still questions, correct me if I didn’t get your question correctly.

Thanks,

Hi,

I have implemented the same way by passing the name as input parameter.But there should be some way to get the name of the pipeline input document,as built-in service we don’t need to pass the name of the output document name

Thanks
Baharul Islam

Baharul – As you specified earlier, you want the same input name as output variable name. So inside code, set ouputvarName=inputVar_name ( you will get inputVar_name as below )

String inputVar_name=IDataUtil.getString( pipelineCursor, “inputVar_name” );
ouputvarName=inputVar_name

Thanks,

Hi MR,

I have service input as inputList(as document list)
and java code as
IData inputList = IDataUtil.getIDataArray( pipelineCursor, “inputList” );

Now when I am using this java service in some flow service code ,I am mapping employeeList(is a document list) to inputList(input of java service).Now after service completion I want output of java service should be of name employeeList.
Hopes this will help to clarify your doubt.

Thanks
Baharul Islam

Baharul – If you want inputList names as part of outputList. For this loop over inputList and assign each of the item to outputList. Is this the one which you are asking …

Thanks,

Are you trying to be proactive when it comes to improving processing time or are you actually seeing sub-optimal performance and that’s why you’re doing this?

If the latter, I should tell you that I have created similar implementations in the past, where several documents were combined into document lists and published as batches for the sake of ensuring optimal performance. However, looking back at it, I regret it. I was trying to solve a problem that didn’t yet exist, and by doing that, I made the application less flexible.

When you publish individual documents (instead of batches) you have a lot more flexibility on the subscribing side, as you can create subscribers whose filters are very specific. It also helps keep publishers and subscribers decoupled and it makes your application more event-driven.

I can elaborate more if needed, but if you don’t have to publish the documents in batches, don’t.

Good luck,
Percio

Hi,

if I understand the original request right, it is about splitting a large list into smaller ones for parallel execution.

As long as there is no identifier in the document which can be used for logical grouping it will be difficult to find a criterion for which message goes to which new list.

Regards,
Holger

If I understood correctly, I don’t think there’s any grouping criteria as long as he ends up with lists that contain a maximum of 100 records each. If we have to think of it in terms of grouping criteria though, we could think of the indices themselves as the grouping criteria. For example, documents[0] through [99] would be placed in one batch, documents[100] through [199] in another batch, and so on. Each batch would be published as a unit.

Now, I don’t understand the whole requirement around naming the input list the same as the output list. I also struggle with the requirement to split the list “without looping”. I don’t see how that’s possible. Whether you loop in Java or in Flow, some type of loop will be required.

Hopefully, my previous post convinces Baharul not to do this batching at all. :slight_smile: However, if he must or if others have a similar requirement, I see a few simple ways to go about it (I’m sure there are others):

  1. The most obvious but perhaps the one that performs the worst: use a LOOP step to loop over the array and build the sub-list as you loop. Once the size of the sub-list reaches 100, publish that list and start a new one.

  2. Loop in Flow using a REPEAT step and use a service similar to PSUtilities/ps.util.list:getSubArray to create the sub-list with 100 records at each iteration. This should outperform the option above because you are looping less times (i.e. list size / 100) and because you’re using System.arraycopy to create the sub-lists.

  3. Create a service similar to WmPublic/pub.document:groupDocuments which takes a list as input and returns a list of lists as output. However, instead of taking a group criteria as input, it would simply take the desired size of the sub-lists as input, in this case 100. This service should also perform fairly well because it loops as much as option #2, it can use System.arraycopy to create the sub-lists as well, and the loop is implemented in Java. Of course, once the service completes, you would have to loop over the output one more time to publish each sub-list.

With all this said, Baharul, whether you decide to do batching or not, there’s one thing that concerns me regardless: it sounds like you will be pulling 5k to 6k records into memory at once. I’m not sure how big these records are or what other processing your IS does, but you should find a way to stream or chunk the data, if you can.

Percio

Thank you all for sharing your feedback and suggestion.

Let me first explain my scenerio first,as it is making some confussion here.
Our service receiving document with iteration of 5000(avg document size)records count.Now ,we have to process individual record.Again to process individual record there is 4-5 step is need to check record in background DB and then make update/Insert as well some others condition checking also.
So,If we process all the records in the main service within a single loop and process individual record then it will take much processing time.In this situation ,my idea was to divide the whole record into some small group and publish the individual list as a unit for prallel processing.

@Percio,
you are correct that some loop is required either in java or in flow service,but I think using java service some how performance can be good if we can use arrayCopy method instaed instead of using loop or repeat operation.
Can you please share some background code logic for ps.util.list:getSubArray/pub.document:groupDocuments if you have or if theres any hidden way to extract the same.

We have written below code set to return smaller block from input document list and starting index for retriving the document will be maintained in original service in repeat step.



Service Inut
============
inputList..as docList
batchSize,startIndex,listName...as String variable

Servive outPut
=============
outPut..........as document

Service Code
===============

IDataCursor pipelineCursor = pipeline.getCursor();
		String	batchSize = IDataUtil.getString( pipelineCursor, "batchSize" );
		String	startIndex = IDataUtil.getString( pipelineCursor, "startIndex" );
		String	listName = IDataUtil.getString( pipelineCursor, "listName" );
			// inputList
			IData[]	inputList = IDataUtil.getIDataArray( pipelineCursor, "inputList" );
			
			if ( inputList != null)
			{
				IDataCursor pipelineCursor_1 = pipeline.getCursor();
				IData	outPut = IDataFactory.create();
				IDataCursor outPutCursor = outPut.getCursor();
				IData[]	tempList = new IData[Integer.parseInt(batchSize)];
				System.arraycopy(inputList, Integer.parseInt(startIndex), tempList, 0, Integer.parseInt(batchSize));
								
								IDataUtil.put( outPutCursor, listName, tempList );
								outPutCursor.destroy();
								IDataUtil.put( pipelineCursor_1, "outPut", outPut );
								pipelineCursor_1.destroy();
								
			}
			
		pipelineCursor.destroy();

Please share your feed-back for the same

Baharul,

I agree that decoupling the code that receives the data from the code that performs the different database operations probably makes sense, so from what I can tell, pub/sub should be a good option. What I struggle with is when you say that “it will take much processing time.” Do you have requirements around how long it should take to process these records? Have you done any benchmarking to confirm that, without batching, it will take “too long”?

Being proactive and thinking about potential performance issues is good but I want to caution you not to try to solve a problem that does not exist. You may end up with an unnecessarily complicated solution that is hard to support and miss out on some key benefits of a pub/sub solution.

Also, please remember that you don’t have to batch the data to achieve parallelism. You can still publish individual records and make your trigger(s) concurrent. Also, note that publishing individual records makes your application more scalable because once you publish a list, that list has to be subscribed and processed as a unit by a single thread on a single server, whereas individual documents can be subscribed concurrently by multiple threads on multiple servers.

Regarding your code, what you have implemented is functionally equivalent to the PSUtilities service (getSubArray) I was referring to. If you want to compare the two, you can download PSUtilities from here: http://techcommunity.softwareag.com/ecosystem/communities/public/webmethods/products/suite/codesamples/18b5c231-b1d6-11e4-8d00-cd8d7ef22065/

I actually suggest you do because, from looking at your code, I see that you may be overlooking a key thing: you may not always have as many elements to copy as “batch size”. For example, if you only have 10 items left to copy but batch size is 100, it looks like you will still create an output array of 100 elements and you will try to copy 100 elements into it, which will likely result in an IndexOutOfBoundsException.

If I have time today, I’ll throw something together around the “groupDocuments” idea. Hopefully by then, I will have convinced you not to use it. :slight_smile:

Percio

Here’s a sample of what the splitList service, based on the groupDocuments service concept, may look like:

IDataCursor cursor = pipeline.getCursor();
IData[] list = IDataUtil.getIDataArray(cursor, "list");   // Original list
int size = IDataUtil.getInt(cursor, "size", list.length);   // Desired size of sub-lists (i.e. batch size)

int iterations = list.length / size;   // Number of times to loop to create sub-lists
int remainder = list.length % size;   // In case the original list size is not a multiple of 'size'
IData[] splitList = new IData[remainder == 0 ? iterations : iterations + 1];	// Output list

int i;
for(i = 0; i < iterations; i++) {
	IData subList[] = new IData[size];
	System.arraycopy(list, size * i, subList, 0, size);
	splitList[i] = IDataFactory.create(new Object[][]{{"subList", subList}});
}
if(remainder != 0) {
	IData subList[] = new IData[remainder];
	System.arraycopy(list, size * i, subList, 0, remainder);
	splitList[i] = IDataFactory.create(new Object[][]{{"subList", subList}});
}

IDataUtil.put(cursor, "splitList", splitList);
cursor.destroy();

I’m attaching a package with the service plus a demo service as well.

Percio
SplitList.zip (9.35 KB)

Percio, I really enjoy reading your posts! Good thoughts clearly expressed. We should compile your posts, this would then be a goog course for webMethods developers.

LOL I don’t know about my posts having enough content for a good course, but I appreciate the compliment nonetheless. :slight_smile:

Hi Percio,

To test the scenario for the performance ,I have created some services…

mainService1:Used to publish incoming document in batch of 100 records.
mainService2:Used to publish incoming document one by one.
StartService:Is used to publish some simple document so that ,it can initiate mainService1 and mainService2 in parallel.
subscribe_service1:Is subscribing document published in service1
subscribe_service2:Is subscribing document published in service2
To make the service simpler for testing I have used getFile to get file of 5k records from file system.

What I have observed is that mainService1 is always completing earlier than mainService2.And subscribe_service1 is completing much much earlier than subscribe_service2.

One more thing that I can observe is that there may be some broker performance hit in case of one by one publishing as dispatcher needs to connect for each document.

I have tested the publishing mechanism using Local ,Broker and UM.
Please find attached package and sample file that I have used.
Please test on your setup and lets know the result that you observe.

Thanks
Baharul Islam

Record_details.txt (293 KB)
PerformanceTest.zip (42.3 KB)

Baharul,

I’ll try to take some time later this week to download your packages, look through them, and run a few tests. For now though, here a few observations:

  1. Running the services in parallel to see which services finish first is not a very scientific way of testing this. I think you are better off running them separately, in isolation, and using some type of timer to determine how long each execution is taking. Take several measurements in case of anomalies.

  2. Similarly, a service executing “faster” or finishing “earlier” than another service is a relative thing. Focus on more precise measurements so you can say exactly “how much faster” one approach is versus the other.

  3. I still don’t see any SLA requirements. How fast does your integration actually need to process these 5,000 records? Where is the requirement coming from? You may find that option #1 is 100 times faster than option #2, but if option #1 completes in 5 milliseconds and option #2 completes in 500 milliseconds, but your requirement is to complete within 5 seconds, then who cares?

  4. I do not see any reference to parallelism when it comes to the subscribing services. Option #2 should do well when configured with concurrent triggers (of course, depending on how well the target system handles concurrency.) So, play around with the number of threads on the trigger to see what impact it has.

  5. Make sure you measure performance from end-to-end, i.e. from the time your main service starts to the time the last record is processed by the subscribing service. Comparing only mainService1 to mainService2 and subscribe_service1 to subscribe_service2 may not show you the whole picture.

  6. Last but not least, in case I have not made my point clearly yet, remember that performance is not everything. Readability, maintainability, re-usability, extensiblity, etc, etc are also very important. Unless you have a certain SLA, as suggest in #4 above, that is not being met by the simpler design, optimization for the sake of optimization can cause more harm than good.

Percio