Large Flat File Processing.

Hi,

I am getting one gb size flat file which I need to process the header, detail and trailer. I need to create multiple XML file depends on the header information.

I am publishing to Broker to process header, detail and trailer records. I created a trigger to process my records in single thread mode but I would like to create multi thread for detail record.

  1. How can I identify whether all my detail records are processed by multi thread trigger?
  2. convertoValue service createas @fields. I could not publish @fields to my broker. How can I drop them from my convertoValue Output.

Thank you very much for your help.

Thanks
Sam.

Hi Sam,
1GB Flat File is indeed a large file and special care should be taken to handle this type of file, considering the fact that IS JVM cannot have more than ~2.5GB allocated.

Below is how I have done to process the file of this size > 1GB containing no. of records > 500,000:

1> To avoid any overhead, used only IS (no Broker, no TN, no Modeller etc…).

2> Wrote a fileSplitter Java service. This service reads the input file as stream, goes through each record to do some validations and creates bunch of tmp files each of records 20,000 (configurable, supplied as input to this service) and creates the output as the list of tmp files thus created.

3> Main flow then processes these files 1 at a time, creates the desired output file and deletes the temp file after processing: all in a loop. The data in the output file is appended each time in the loop.

4> Found that the split size of 20,000 records were optimal (Total processing time < 2 hrs.). Setting it to higher or lower value increased the total processing time.

5> Solution is scalable e.g. if the input file size grows in future, the splitter will create more temp files but the IS will handle only small chunk of data (20,000 recs) at a time and so will not go out-of-memory.

Your integration could be totally different than what I had but wanted to give you some pointers to ponder over while handling a large file > 1GB.

HTH,
Bhawesh.

Nice description Bhawesh.

Questions for Sam:

Is there a reason for publishing the records to the Broker?

Does the order of the records need to be maintained?

Do the records need to be processed as a group?

Do the header and trailer have meaningful information or are they just control records to verify that you’ve received the entire file?

You can publish @ fields (attributes) to the Broker. You just can’t read them with anything on the subscriber side other than IS.

What is the target of this 1G file? Another file in a different format? Inserts into a database table?

The answers will help guide an appropriate solution.

Thank you very much for your help.

  1. Yes, I need to maintain the order in the Line number level like below.

Header
Line 1
Detail1.1
Detail1.2
Line 2
End of record.

  1. Yes, I need to process the whole file.

  2. Header has meaningful information but trail does not have any information.

  3. I am creating an electronic catalog file.

In the feature I need to support multiple version of xml document from single flat file.

  1. Must the data for item 2 be processed after item 1? Of course the lines that make up an item (line1, detail1.1, detail1.2) need to stay together, but can the items be processed independently and in any order?

  2. The question wasn’t whether or not you need to process the entire file. But rather, whether or not all the items in the file must be processed as a single unit–in other words, if one item fails for some reason, can you continue with the remaining items or do you need to stop and rollback all work up to that point?

  3. A flat file? An XML file? An “electronic catalog file” is not descriptive enough.

  1. Each line must be processed as a unit.
    Line1
    Detail 1.1
    Line2
    Detail 2.1

  2. Yes, I need to rollback if I have any exception during my processing.

  3. It is an XM FILE.

In this case, the process outlined by Bhawesh should work fine for you too. You do not need to publish items to the Broker. Do not use it as part of your solution. Be careful with how you write your file splitter service–do as little processing as possible.

Hi Sam,
I can send you the fileSplitter java service. As Rob mentioned, I have optimized this service for optimal performance, because this the service which takes the hit of reading large data.

  • Bhawesh.

I am having a similar issue with what is being described on this thread and would appreciate any insight anyone can offer. I need to process a 60M-70M file. The format of the file is:
O - Order header information.
B (1 occurrence per O)
S (1 occurrence per O)
P (multiple occurrences per O)
There are about 70,000 Order records that need to be processed. I do not need to store these records anywhere, just process them and send an email.

I am able to run my service if I read a small sample file using getfile. But when I set iterate=True, I cannot do anything with the data. I am mapping ffvalues to a schema document created by a schema, and I use the correct schema name in ffschema value in convertToValues. If I savePipelineToFile, I see data in my schema document, but if I try to write any of the schema data to the debug log, the values are null.

As I mentioned before, if I bring the whole file into memory using getfile, the service works fine. I just cannot seem to be able to stream in the data.

Can anyone help?

Bhawesh,

Can you pls send the fileSplitter java service to me also? I have a similar problem come up. My email id is: sspone-wmusers@yahoo.com

Rgds,
Sandeep

Hi Bhawesh,

I need to split a huge file.
Can you send me your fileSplitter java service ?

My email : thierry.ahcow@bnc.ca

Thanks
Thierry

Hi Bhawesh,

I require to split failry large files being sent across different IS via a broker, one in the US and one in ASPAC.
The file sizes are approximately 50Mb in size, but there are around 50+ documents that get sent to the broker on the US side, so the broker is receiving quite a lot of traffic.

Could you please email me the filesplitter service to aditya.gollakota@customware.net

Thank you,

Aditya Gollakota

Bhawesh/Sandeep/All,

Can you pls send the fileSplitter java service to me also?
My email id is: datta.saru@gmail.com
Plz send it ASAP

Regards,
Datta

2> Wrote a fileSplitter Java service. This service reads the input file as stream, goes through each record to do some validations and creates bunch of tmp files each of records 20,000 (configurable, supplied as input to this service) and creates the output as the list of tmp files thus created.

Please send me the java service. Please do me the favour
My Id: datta.saru@gmail.com

Regards,
Datta

Bhawesh,

I saw a lot of requests for the file splitter service that you have written. To avoid lengthening the thread for such requests only, I would request you to attach the zip of the code into the post itself.

Regards

Hi,
I get lots of request for this service so thought it will be good idea to attach it here.
HTH,
Bhawesh Singh.
sortSplit.zip (16.1 KB)

1 Like

A quick note for all that splitting a file is also doable with the flat file services provided in IS.

Oh they are not necessary flat txt files. They could be zip files as well. I need to connect to the SFTP server and get them. I was thinking of letting the underlying Unix commands to get the file instead of getting it into the pipeline memory! But how do I do that? Any ponderings?

You won’t be able to “split” a zip file, per se. The files within the zip will need to extracted, then process the resulting files.

The FTP facilities can be used such that entire files are not loaded into memory.

Great Job B Singh…Your code works fine with 32 bit webMethods,However some times it is skipping a portion of a line in 64 bit webMethods.please advise.i dont know Java.Do you have another service which does the same in 64 bit.Thanks