Issue with File read

Guest · September 22, 2005, 2:00am

Hi,

I am using pub.io:read along with getFile service to read a csv file.
It is taking too much time to read the file. First 100 lines are read pretty fast, after that it is taking exponential time to read more lines. Example 200th line to 300th line 30s, 500th line to 600th line 3mins …

Please can any one help

Thanks in advance,
Hitesh

reamon · September 22, 2005, 3:46am

Can you post your FLOW snippet? I suspect the slow down may be due to repeated reallocation of a byte array or of a string list.

Guest · September 22, 2005, 4:34am

Attached is the image of WM flow.

Steps followed -

Get length of each line in a list, this list will have as many records as there are lines
get file as a stream
loop over the list from step 1, total loop count = total no. of lines
Read row
Parse row
Publish
Call garbage Collector after every 100th row.

We even disabled the publish … but still no use

Thanks for your help
Hitesh

reamon · September 22, 2005, 7:37am

General rule of thumb: don’t allocate things when looping. Move as much as possible outside of the loop. Reusing arrays, string buffers, etc.

Disable more and more until you can isolate what is causing the bottleneck. I suspect that calling createByteArray and bytesToString for each loop may be contributing. You can’t really get rid of the bytesToString but you should be able to use just one byte array, allocated before the loop and large enough to hold the largest row.

tokeniseString would be another thing to investigate.

Be aggressive in dropping pipeline variables. Drop vars as soon as you possibly can.

HTH

Guest · September 22, 2005, 8:47am

Thanks for your suggestion, I will try this and let you know

Thanks again,
Hitesh

wMusers.Com1 · September 23, 2005, 6:56am

I’d get rid of the GC call for every 100 lines. If you have a large heap and stop-the-world GC, you’ll be adding a significant pause for every 100 lines.

Following Rob’s advice re: not allocating variables inside the loop should eliminate the perceived need to call System.GC().

Mark

Guest · September 23, 2005, 10:39pm

Thanks for the suggestion, I tried both (not allocating variables in loop and removed gc call). But no improvement.

Finally added a service to read the file using standard Java i/o API and load it into list. This worked !!!

Now its able to process 21k lines in about 10mins… !!!

Thanks for all your help
Hitesh