Large Flat File parsing and Database update.

Hi wMExperts,

I have a flat file with size of 26 MB and having 250000 records with 5 columns each(Pipeline delimited). I need to parse each record and insert it into a database.
Currently I am loading the file as stream(pub.file:getFile) and using iterator to convert to values. Also performing the database insertion in the same iteration.

Problem:
Its taking a huge time to complete the execution (More than 30 mins).
Can anybody suggest a better approach for it?

Any pointers are welcome.

Regards,
Niteesh

In the getFile service are you setting loadAs bytes or stream?

streams is always fast and better approach especially dealing large files.

using iterator functionality in the parsing is always good approach.

You said database insertion - are you using batch insert or record by record looping (obviously this takes time and any indexing on the tables associated for increasing performance??)

HTH,
RMG

I see you are using a stream and the iterator. That’s a good start.

What likely making it slow is read one, write one, read one, write one, etc.

If you can modify your services to read multiple, say 100 or 1000 lines (how many depends upon the size of the records), then write them using a batch insert service. You should see a significant decrease in processing time.

Another alterrnative for consideration, if there is no transformation of the data needed before insert, is to not have IS do the reading and writing. Instead, have it use a DB command-line tool. For example, if the DB is Oracle you could use SQL*Loader. That would load the data far more quickly than IS and the JDBC adapter will be able to do. Of course there are drawbacks to such an approach but if your scenario fits the constraints it may be the way to go.

Yes, I am using stream. Also I am forming the document list(By appending individual doc in each iteration). Once i get the complete list, I am doing batch insert. But still its taking considerable time.
Is there anything that could still improve execution time?
One of my colleague also suggested to chunk the file and do separate process, will that improve?

Thanks &Regards,
Niteesh

Thanks Rob,
I have tried using vector for appending purpose, but still no considerable improvement. Yes, I will try to insert 1000 records at a time, will see the effect.
Also, can you tell how to use SQL Loader? I am forming a doc list and doing batch insert. But anyways, i dont think its poor response is due to DB operation.

Awaiting your response,

Thanks,
Niteesh

“Also I am forming the document list(By appending individual doc in each iteration). Once i get the complete list, I am doing batch insert.”

That is not a good approach for this many records.

The suggestions we are making are the same as your colleague–chunk the file. That is:

Read 100 records. (or 500 or 1000 or whatever works)
Write 100 records in one batch insert call.
Repeat until done.

Do not use Vector, or appendToDocumentList or any collection that reallocates memory when the capacity needs to grow. Use LinkedList or similar to collect the 1000. Then get the array for that list just before doing the batch insert to the DB.

I wouldn’t try multiple threads/processes for processing the file. You’ll spend a lot of time and effort trying to get that to work in a coordinated way.

Learning how to use SQL*Loader in a discussion forum probably isn’t the right place. You can search for the docs on the web for that. Or get info from one of your DB people.

Fine Rob. I will use Array List for appending purpose. But I have one more question, whats the optimum input capacity of batch insert input. Will it work efficiently without any failure for high load?

SQL loader will be completely out of webMethods i guess rite? That will need shell scripting if i am not wrong.

Regards,
Niteesh

SQL loader will be completely out of webMethods i guess rite? → YES

whats the optimum input capacity of batch insert input. Will it work efficiently without any failure for high load? → YES as long db/table can handle it or use a Stored Procedure for performance if that route is doable for you:

Also there is some configuration notes for batch_insert SQL template, please review the JDBCAdapter user guide for more information:

HTH,
RMG

ArrayList isn’t a strong choice either–it is just like Vector and uses an array that is resized as needed.

The optimum capacity depends on the environment–the JVM, memory, the DB, etc.

Yes using SQL Loader will involve shell scripting and calling it via command line shell from IS.