I’ve read the large document handling documentation and am able to split the documents up into components. For e.g. If a Purchase Order (PO) has 1000 lines, then I am able to get it at the stage where each line can be extracted without reading the entire document into memory. But the next question is, how do you process such a document into a backend system. E.g. If you need to process this single PO into an SAP system, then you still need to send all the line information together.
Could you please share some insights on how the backend system processing can be done ? Does the transaction have to be broken up and then reassembled?
Have any of you split large documents ? Any experiences that you can share would be great.
That depends on the target system you are using.
Suppose the target is a DB, then you can easily add the PO data to DB in the order you split it. But if its an SAP system, then it depends upon the translation/mapping.
I could see a possibility that for each line in PO you can generate the particular segment in the target IDOC (by mapping logic) and save it as a flatfile on disk. And finally you can merge these flatfiles (segments) into the idoc and send to SAP, so that you need not to load both the PO and the IDOC into memory at same time.
We are looking at a similar issue. We don’t have complex POs with 10000 PO1 segments. We have hundreds to thousands of ST segments that make tidy little transactions. Writing to the file system on a non clustered IS might work for some, and it may be your best bet if it fits.
In an enterprise based environment, we are using multiple IS clusters, and intend on using the TN cluster to parse the large doc into single transactions. Publish them and allow the other bank of servers to translate. This puts the recognition and transportation DB intensive logic in one bank of JVM memory and the translation into an other.
The sticky part is in re-assembling that original into a final text file. Ideally, we will push each translated record to the consuming business system as we split them out. That may not be acceptable for all of our business units that want or require a file/batch process.
Because our audit requirements mandate a final record count, we cant simply perform an FTP append to the staging file system. Our thoughts at this stage of the game are to write or copy/modify the EDI batching routine for flat file records and allow time for any and all records to make it into the batch before we trigger a delivery. this will allow all records at the time of batching be counted and reported. We could then report the transaction count in and compare to the count out+failed.
Any thoughts on this as a high level concept if not the details?
Periodic target batching, which just batches and sends what has been queued up to now, is easier to deal with. This allows the source to be a batch group or individual transactions. It also more readily allows parallel translation.
But if you cannot disconnect source batch from target batch (1000 docs in, 1000 docs out), then perhaps this approach might be workable.
Take a more explicit approach to handing out translation assignments and accepting the results. In other words, create a “batch” manager that farms out the translation work (via publish or whatever mechanism you’d like) and explicitly gathers the results (not necessarily the actual content–just pointers to where the results are, like a bizdoc ID or something). For example, batch process 1234 expects 1011 children and is explicitly tracking that somewhere.
For example:
With a large doc, just 1 IS/TN instance is going to be able to split it. So it can be the batch manager, taking note of the transactions sets to be split and translated.
Publish “do work” messages containing the TN document ID to be translated.
The workers translate their documents, creating a resulting target doc stored in TN. Each worker publishes a “done” message containing the TN document ID that was translated. On error, they’d publish a “done with error.”
When all of the done messages have been gathered, then the individual docs can be gathered together (in a memory friendly way) and sent to the target.
You’d probably want to use a set of tables and records for tracking, similar to the EDI tracking table but oriented to batch translation management.
The details to work out would be what to do with individual translation failures (fail the whole batch, continue with the batch but leave this one out and report it, etc.) and how to handle mulligans when the data center has a power outage in the middle of things.
Something I have done for handline large files even (ascii) is to read things in as a byte array. So, you might handle 1024 bytes at a time from the file stream. This way, your pipeline doesn’t fill up. So, you could handle n-bytes at a time process them or even build up your backend-system file while using
little memory in IS. In this manner, you are only handling small chunks of data at a time.