Large Flat File parsing using Iterator

We were getting a new interface where we receive the flat file (around 30 MB) over HTTPS. We were planning to route the flat file content to TN and process it further from there. (We have large document handling configured in TN and the file size of 20 MB will be considered as large file).
When IS service is invoked from processing rule, flat file content part is retrieved using ‘getContentPartData’ as a stream and passed for flat file parsing. During flat file parsing if we enable 'iterator’to process top level records, will it only load the chunk which is used for parsing into memory or will it maintain the whole flat file content in memory? As per documentation, regardless of file size it looks FF adapter will maintain all FF content in memory.

What will be acheived out of enabling the iterator during FF processing? Is it only the ffvalues which is dropped out of pipeline for that chunk saves some memory?

As well IS document generated out of parsing 30 MB ff has to be carried over the broker for target side processing. Will there be any issues in carrying a huge document over broker?

Using iteration, only top level elements will be retained in memory.

ffvalues will be dropped when your FLOW drops it. The iterator manages the amount of source data kept in memory.

Yes, there will most likely be issues.

Creating a full IS document out of the flat file kind of defeats the purpose of doing the flat file iteration. You’re loading the entire file into memory and in a representation that is rather heavy–it will consume far more than 30MB.

Publishing large documents through the Broker can also be problematic.

What is the nature of the data? Does it all need to be processed as single atomic unit? Is anything being done to manipulate/transform the data in IS or is it just being passed through? What is driving the desire to publish this data as a single document?

Depending on the answers, we may be able to identify other ways to process the data efficiently.

FF contains employee info in record format. Each record is unique and there is no dependency over the other record.

Yes, Source data need to be transformed in IS. Its not just a pass through of data over wM.

Not really. We will be able to transform each record independently, as there is no dependency over other record.

  1. Target systems needs the whole data as a single file. 2) Source data has to be transformed to different formats for different end systems .i.e. multiple subscribers for the same source data

Won`t it possible to handle a large ff within webMethods, without chunking the files?

Thanks for the additional information.

There are a couple of strategies that can be followed to work with large files yet be memory/process efficient.

Option 1:
Publish individual records from the source file, collect them into files on each target side. As each record is received by a target subscription, it is transformed and appended to a file. Then periodically, the contents of the file are given to the target system (FTP, file move, etc.) for import/processing.

Option 2:
Keep the source file in an accessible location–FTP server, network volume, etc. Publish information about the file, not the file itself. Then each target process read and transform the source file to their target format (using node iteration and writing each record to a file, not to memory) and provide the file to the target system.

I’m sure others on the forum may have other options for consideration.

HTH

Won`t this cause a mutliple subscribers to try accessing a single target file for appending the information, which can result in file lock error (file already in use)?

Reamon, We will be receiving the FF over HTTPS which will be routed to TN by Gateway service. So can we do something like, service invoked by TN processing rule can publish the Bizdoc information (TNInternalID) to Broker.
At subscribing side,

  • each subscriber can read the FF content from TN as a stream using ‘getContentPartData’,
  • parse top level records using iterator
  • transform to target format
  • append the string formed after transformation to target file
  • After parsing all records, deliver the target file to end system
    Do you see any known issues out of this design?
    One noticeable point is when we have multiple subscribers for source data, same data will be parsed at each subscribing side.

Only if the subscribers are trying to write to the same file–which should never be the case. One subscriber, one target file.

Which is necessary in any case, correct? It is unlikely that a single parsing pass will be able to generate more than 1 target representation.

I can suggest you a multithreading approach …divide the flat file in no of chunks and publish the temp filename of each chunk to ur service and let it process and create individual chunk again …at the last merge all the chunks and publish the outfilename to target system along with path . to control the no of chunks and all are processed use jdbc adapter to make ensure tht all chunks are processed :slight_smile: