Read File List Content Performance inefficiency

I have a directory which holds invoice files ranging from (1-30000) which then a java service is invoked (on a time interval basis) and it will perform a “read file list content” and create a single “batch” file out of how ever many invoice files there were in the directory. The problem I am having is performance. For 5000 or less invoice files, the java service works well. However, for anything >5K invoices it takes it very long to read and create this batch file.

Are there any other utilities or tools vs. with this custom java service which I can use to accelerate the process? For example, 30K files would take 3-4 hours to complete the batch file creation…

Batch scheduled bulk loads are best done with ETL tools like Informatica, Embarcadero, Microsoft DTS, etc.

In this case, the critical question is: Are the format of the invoice files consistent? If so, this is definitely an ETL problem.

If they invoice files are all formatted differently, the solution will be more complicated.

Regards

Mark,

The formats are consistent. Can you elaborate a bit more regarding ETL?

An ETL tool performs some of the same functions as an EAI tool. For example, take a source field, and map it to a target field with an expression applied.

The major difference is that the ETL tool is optimized to move large numbers of consistent source records through a ‘transformation pipeline’ (This is a different use of the word pipeline from webMethods, here it refers to a set of potentially hundreds or more records all being processed at once) on it’s way to the target. They will use large caches, parallel processing, and target database bulk-load interfaces to move this data very quickly.

If you do not have an ETL tool available, then what to use depends on the target. If it’s a database, I’d look at your database vendor’s bulk load tool. If it’s a file, perhaps a Java program.

Regards.

We are moving files from the webMethods server to the mainframe system via ftp. The “moving” itself is not the bottle neck here…it is actually the file prepping (meaning bundling 30000 individual invoice files to one big file to be ftp’ed to the mainframe). We use a custom java service to do this bundling routine.

ETL sounds like a good long term fix but I am more interested in any quick fix that would improve on the performance of bundling the files to be ftp’ed. Any suggestion is appreciated.

Thai Tran,

I agree with Mark R. that WM IS may not be the most elegant solution for this task. If you had access to an ETL tool and time to learn how to use it, that may be a better fit. Given that you’ve said that that is not the case here’s a couple of questions:

  1. Does your custom “bundling” service use the java’s streaming io classes (e.g. java.io.FileInputStream and java.io.FileOutputStream)? If not, you may be running out of available memory and having to wait for GC to occur in the IS jvm. You could confirm this by running IS with the appropriate verboseGC parameter for the JVM you are using. Do you have a single instance of the bundling service running? If so, it may be possible to run multiple “bundlers” on their own threads and then combine their results. Of course, this approach would assume that you are using streaming IO classes and that you have enough processing power and memory on the server.

HTH,

Mark

I noticed, particularly on MS Windows, that the file system can slow down significantly when navigating in directories with >~500 files. So if you are on Windows you may want to look at minimizing calls like File.list() or saving off the files into a number of subdirectories. Note that Windows can cache this info, so it is hard to create a repeatable test that shows what the disk navigation cost will be to your app in the wild.

On non-Windows systems you may need to up the default number of File Handler made available to each process by the OS if you are going to be parallel processing of the files.

It does sound to me like having N doThreadedInvokes each processing a directory with roughly 1/Nth the total number of files could get you a good performance boost, particularly on multi-CPU boxes.

Cheers,
Fred

Mark,

Thanks for your post. To answer your question to 1 and 2…I am not using the java.io.FileInputStream and java.io.FileOutputStream for my java service. I will look into this and make the change to see if it helps.

Agree with Fred that writing your invoice files to multiple directories could be beneficial both from a Windows OS performance perspective and because it simplifies use of multiple threads by avoiding the need to come up with a scheme to assign files to be processed to a particular thread. Each thread would just process the files in its assigned folder. Of course, that would assume that roughly the same number of files were written to each folder yada-yada-yada.

Mark

Thai: Might it be possible for you to post your Java service for review? We might then be able to point out performance issues within the service, if any.

Rob,

Here is the custom java code. Please let me know what your findings are…

Thanks


FileUtil.java (3.7 k)

Things I notice on cursory review:

  • The service loads the contents of all 1-30,000 files into memory at once. This will undoubtedly be slow when the number of files, and size of each file, is large. In the extreme case, it may crash your IS.

  • The service is using BufferedReader. Since the service loads the entire file into memory anyway, using BufferedReader isn’t helpful.

  • The service is using readLine() which is very expensive. Since the service is simply concatenating files together, using readLine() is unnecessary.

  • Does the service compile properly? In the readFileList() method the FilenameFilter variable is defined within the else block but is referenced outside the else block–thus it is out of scope and the code should not compile.

  • Within the catch blocks, the code should check to see if the Reader needs to be closed. If an exception is thrown, a file could be left open, causing file handle/resource leaks.

What are you doing with the concatenated string after it is constructed? Writing it to a file? Sending it via FTP or HTTP to another server? Knowing what the end use of the concatenated files is will help determine how best to restructure this code for speed and efficient use of memory.

Rob,

Thanks for your valuable comments. The end result is to ftp the file to the mainframe system.

Thanks

I would offer that if the sole intent is to concatenate multiple files to FTP to the mainframe that other tools/approaches would be better suited than Integration Server. Shell scripts, FTP server tools, etc. are better suited for this type of operation.

However, if you must use IS for some reason, here are the steps you’ll want to follow:

  • Use BufferedInputStream and BufferedOutputStream classes. No need to use Reader classes unless you need to perform some sort of character or EOL manipulation.

  • Open a single FileOutputStream wrapped by a BufferedOutputStream to hold all the contents of your files.

  • For each input file, use FileInputStream wrapped by a BufferedInputStream to read the contents of each file. Write the bytes of each input file to the output file. Be sure to close each input file to avoid leaking file handles.

  • Once all the files have been written to the target output stream, close it and open it as stream to pass to the appropriate FTP service. Close the output stream when sent to the mainframe.

This approach should provide decent speed and handle any size of resulting file without exhausting memory.

HTH.

1 Like

Rob,

I’ve run into this situation so many times at multiple sites this year that it is making me question whether we (the IT community) are making any progress towards becoming technology generalists.

The situation is always some type of bulk data movement that has been running a very long time. The developers have the deer-in-the-headlights look, the project manager insists it’s simple and shouldn’t take so long, the architect says an ETL tool wasn’t considered because they already had a mapping tool in webMethods. The CIO wants to know what’s “wrong” with webMethods.

All the alternatives you suggested are indeed much better. Using the Integration Server for large bulk processing is like emptying a pond through a soda-straw.

Regards

Mark/Rob,

Well said!!! I appreciate both of your feedback.

http://java.sun.com/docs/books/tutorial/essential/io/catstreams.html

How about this?