I am using pub.io:read along with getFile service to read a csv file.
It is taking too much time to read the file. First 100 lines are read pretty fast, after that it is taking exponential time to read more lines. Example 200th line to 300th line 30s, 500th line to 600th line 3mins …
General rule of thumb: don’t allocate things when looping. Move as much as possible outside of the loop. Reusing arrays, string buffers, etc.
Disable more and more until you can isolate what is causing the bottleneck. I suspect that calling createByteArray and bytesToString for each loop may be contributing. You can’t really get rid of the bytesToString but you should be able to use just one byte array, allocated before the loop and large enough to hold the largest row.
tokeniseString would be another thing to investigate.
Be aggressive in dropping pipeline variables. Drop vars as soon as you possibly can.
I’d get rid of the GC call for every 100 lines. If you have a large heap and stop-the-world GC, you’ll be adding a significant pause for every 100 lines.
Following Rob’s advice re: not allocating variables inside the loop should eliminate the perceived need to call System.GC().