Handling Large CSV file

Hi.

Could you give any advice about to handle (read) a large csv file to prevent the memory issue?

  1. When using Flat File.
  2. When NOT using Flat File

Best regards,
Shotat

Hi Shogo,

csv handling is usually a candidate for flat file handling.

See the FlatFile Users Guide for further informations, it also contains a chapter regarding large file handling.

Regards,
Holger

Per my understanding, it’s better to use flat file if the data structure is complex and no existing specific solution. CSV is too normal, and you have hundreds ways to handle it. Using flat file is one of them, but maybe not the most efficient one.

If you want make the code simple, and don’t want to import much 3rd party library, you could use flat file as well. But if you’re chasing the best performance, maybe you could choose other CSV solutions. Both of them has the ability to handle large file.

Holger-san, Xiaowai-san,

Thanks for the advice very much.
The target csv format is not so complex that I would like to use Flat File as much as possible.

I’m studying the following section of “Flat File Schema Developer’s Guide Version 9.12”.
Regarding “pub.flatFile:convertToValues”, setting “iterate” to true can be effective for the memory consumption
from this description.

If my understanding is wrong, please sincerely tell.

Best regards,
Shotat


Handling Large Flat Files

By default, Integration Server processes all flat files in the same manner, regardless of their size.
Integration Server receives a flat file and keeps the file content in memory during processing.

However, if you receive large files, Integration Server can encounter problems when working
with these files because the system does not have enough memory to hold the entire parsed file.

If some or all of the flat files that you process encounter problems because of memory constraints,
you can set the iterator variable in the pub.flatFile:convertToValues service to true to process top
level records (children of the document root) in the flat file schema one at a time.

After all child records of the top level record are parsed, the pub.flatFile:convertToValues service
returns and the iterator moves to the top level of the next record in the schema, until all records are parsed.

This parsing should be done in a flow service using a REPEAT step where each time the
pub.flatFile:convertToValues service returns, the results are mapped and dropped from the pipeline
to conserve memory.

If the results were kept in the pipeline, out–of–memory errors might occur. The pub.flatFile:converToValues
service generates an output object (ffIterator variable) that encapsulates and keeps track of the input
records during processing. When all input data has been parsed, this object becomes null.
When the ffIterator variable is null, you should use an EXIT step to exit from the REPEAT step and
discontinue processing.

Tatsuzawa-san,

Is this topic still relevant for you?

Christian

Handling Large Flat Files

By default, Integration Server processes all flat files in the same manner, regardless of their size.
Integration Server receives a flat file and keeps the file content in memory during processing.

However, if you receive large files, Integration Server can encounter problems when working
with these files because the system does not have enough memory to hold the entire parsed file.

If some or all of the flat files that you process encounter problems because of memory constraints,
you can set the iterator variable in the pub.flatFile:convertToValues service to true to process top
level records (children of the document root) in the flat file schema one at a time.

After all child records of the top level record are parsed, the pub.flatFile:convertToValues service
returns and the iterator moves to the top level of the next record in the schema, until all records are parsed.

This parsing should be done in a flow service using a REPEAT step where each time the
pub.flatFile:convertToValues service returns, the results are mapped and dropped from the pipeline
to conserve memory.

If the results were kept in the pipeline, out–of–memory errors might occur. The pub.flatFile:converToValues
service generates an output object (ffIterator variable) that encapsulates and keeps track of the input
records during processing. When all input data has been parsed, this object becomes null.
When the ffIterator variable is null, you should use an EXIT step to exit from the REPEAT step and
discontinue processing.

can any one send the code for above explanation