we are processing very large flatfile which can be close to 700-1gb. So, we validate the flatfiles ( reading as stream ) , once the output list is generated we are simply doing a batchInsert to the target table.
Do you see any issues here? I mean, does batchInsert has any limits?
How are you processing the file with the flat file services? Are you using pub.flatFile:convertToValues with iterate set to true? Based upon the comment “once the output list is generated” is seems like you’re loading all records into memory at once, which may be an issue.
By “how much jvm” I assume you mean CPU and memory allocated to the JVM.
While this is a factor to a degree, the other concern is loading a 700MB - 1GB file completely into memory. And having the content of that file replicated 2 or more times within memory – e.g. during read, copied during mapping, etc. The data could be in memory 2-3 times.
The flat file and batch insert services (and XML and document services) tend to lead to solutions that load everything into memory at once. For most “event-driven” solutions (using this term loosely) this isn’t a concern. But when dealing with a large amount of data, such as in this case, it may cause a failure by exhausting memory, particularly if the JVM is doing other work at the same time.
The key here is to structure the reading and writing of the data in chunks, never all at once. If you have a single document list that holds all of the records, you’ve read everything into memory. Using a stream to read the data is but one step to follow.
Use stream to read the data.
Use iteration to read X records at a time. The number can vary, depending on record size, memory available to the JVM, etc. Testing will help you determine the “optimal” batch size.