Need suggestions on how to merge document lists.

Hello,

I have a need to merge 2 or more document lists into 1 master document list.
The source lists have the same structure.

The target list needs to contain a unique set of all the data found in each source list.
There is a “key” element (“Person ID”) in the documents I need to use for comparison.

I built a prototype in “flow” code by looping through 1 document and using the XML querying services to search the second for the same records. If they are not found, append the current doc from list 1 to the master document.

This approach works, but I imagine it’s probably going to be quite slow when the document lists have 100’s or 1000’s or records.

Any suggestions? All comments, ideas, further questions, would be appreciated.

Thanks,
Matt

I’m assuming that the document field values are the same when a person ID occurs in both (identical records).

A couple of Java services using a HashMap should do the trick.

First, create a couple of helper Java services.

hashMap:put
–Inputs: map (object), key (object), value (object)
–Outputs: map (object)
IDataCursor idc = pipeline.getCursor();
java.util.HashMap map = (java.util.HashMap)IDataUtil.get(idc, “map”);
if(map == null)
map = new java.util.HashMap();
map.put(IDataUtil.get(idc, “key”), IDataUtil.get(idc, “value”);
IDataUtil.put(idc, “map”, map);
idc.destroy();

hashMap:valuesAsDocuments
–Inputs: map (object)
–Outputs: list (document list)
IDataCursor idc = pipeline.getCursor();
java.util.HashMap map = (java.util.HashMap)IDataUtil.get(idc, “map”);
IDataUtil.put(idc, “list”, map.values().toArray(new IData[0]));
idc.destroy();

Your merge steps can be:

LOOP over ‘/list1’
…hashMap:put – map list1/personID to key; list1 to value; map to map
LOOP over ‘/list2’
…hashMap:put – map list2/personID to key; list2 to value; map to map
hashMap.valuesAsDocuments – map to map

HashMap.put allows only unique keys. When a list2 entry with the same key as a list1 entry is put to the list, the list2 value (document) replaces the value that was there.

The valuesAsDocuments assumes that the type stored in the values portion of the map object is an IData object.

With this approach, these 2 lists:

list1 (personID, name, city)
1, Abe, Anaconda
2, Bill, Butte
3, Charlie, Cutbank
4, Dan, Dillon

list2
3, Charlie, Cutbank
5, Eamon, Ennis
6, Fred, Fort Benton

Will result in this document list:

1, Abe, Anaconda
2, Bill, Butte
3, Charlie, Cutbank
4, Dan, Dillon
5, Eamon, Ennis
6, Fred, Fort Benton

The put service could accept another input, initialCapacity, to allocate enough space up front but that might not be necessary.

I hope that’s what you were looking for!

Thanks for the assistance Rob…This looks like it’s exactly what I need.

Do you think it would make a difference on how I do the looping?
Meaning…in flow or in a new java service?

My assumption is that it would be quicker in java.

Never mind…

I wrote the whole thing in a java service…most definately faster.

Thanks for your help again Rob…

You’re welcome.

Be careful about doing things in Java just because they seem to be faster. If you did the entire thing in 1 Java service, you now have something that can be used for just one thing–merging to doc lists. Also, is speed the only consideration?

With the approach of focused Java services used by a FLOW service, you will have a couple of Java services that can be used any time a hash map would be handy. Plus, the FLOW service can be debugged in Developer whereas the Java service cannot.

Typically we only use Java when absolutly necessary. For the most part we try to do everything in flow as you saw with my original post. For now though, I posted this request purely for performance reasons.

I agree that creating these as separate services would be good. I may end up doing that anyway as we do maintain a common area for shared enterprise utility services.
This service might be a tad more specialized than I would normally want them to be but it’s not set in stone.

Cool. Sounds like we have similar philosophies! :slight_smile:

Please disregard this message…I figured out the issue on my own…

Rob,

I’m running into a little problem dealing with hashmaps when an element in the document I’m trying to add to the hash map is NULL (not even in the pipeline).

Here’s an example

1 Test1 Test1
2 Test1

This XML represents an example of the structure of the document list I’m trying to hash.
While debugging, I get an “NSRuntimeException” when the HashMap.put command is called on the second document.

I know it is because “PARM2” is null in the document. If I edit the pipeline and force the value to be empty string…it works.

Do you know of anyway around this?

If I can get away with it, I don’t want to have to loop through the document list in flow to default the parameters to empty string if they are null.

Any thoughts?

Thanks,
Matt

Can you share the resolution?

Sure…no problem…

First, the error that was occuring was an NSRuntimeError and I was losing my connection to the integration server everytime I ran the code. It would then reconnect instantly. Odd.

I spent probably 5 hours trying to figure things out and as things would have it, it wasn’t really an issue with NULLs. I tried the code with the data “as is” as well as defaulting all NULLs to empty string…no go. In any case, I found out, HashMaps can accept NULLs for the value, and the Keys. So that wasn’t the problem.

Then, almost by accident, I decided to just “run” the code instead of steping through.
Interestingly enough, it worked as long as I dropped the hashmap object from the pipline prior to the service completing.

Something about this particular chunk of data (memory issue maybe) would not allow me to debug the service. It’s odd really, since I used this code in a different interface service where the document listing being hashed has more records that this one. It does not crash and I can step thought the code with no difficulties.

I’m not sure if you would call this a solution…it’s more of a “luck” thing.
I’m still trying to get a root cause but I was able to get this working by adding temporary code to analyse the hash map without looking at it on the pipeline directly.