I recently got a request to generate a XML file based on the data returned from several MSSQL tables. The only problem is that we can even have more than 30 000 records returned from one table only. Naturally, I created several adapter services with select template and data returned from them is mapped to a specific structure after which I call documentToXMLString and a custom Java service which writes the resulting String using the write method of OutputStreamWriter. When I try to run the flow it basically slows down the entire IS which at some point dies (the JVM dies). My question is, how would you implement this in an efficient way so that it won’t affect the IS?
First of all you should make sure that IS is configured to use enough JVM memory to hold the complete data in memory.
Second you should try to find out which part is slowing down the IS, is it the reading from DB, the aggregration into the document structure or the writing of the file.
You will have to check the different log files when the JVM dies to find out the reason which caused the JVM to die.
Tanuki Wrapper should take care of restarting the JVM and IS when the previous JVM died.
The IS now has 16 gb of RAM configured. After making sure that some of the source records got deleted from the source tables I ran the job again (there are still a lot of records) and I can see that pub.xml:documentToXMLString is still shown as executing in the thread list (it’s been more than 2 hours).
Couldn’t resist, you should not be trying to process all these records in one go. It doesn’t scale and it it fails you have start from 0. In addition even if, which is a big if, you manage to size the system for the current number of records, is that the max limit ? Can you guarantee that in production it won’t increase again!
You need to revisit your code to introduce pagination. You will have to create a custom query using the custom sql jdbc service and introduce a row number.
Your flow service should then use a repeat and pull a page of rows into memory, map them and write them to outbound file. For extra resilience you could persist the index via pub.storage:* services so that if it fails you can resume from where you left off.
When you have finished, send the file to the recepient via a stream object to avoid having to read the entire file into memory.
regards,
John.
No – do not read all the data in one big gulp into the JVM. Instead ‘slide’ over input and output data.
A solution:
Write the XML headers (tags that begin your XML dataset) to your target XML data file
Use database cursors to read database records in batches
As you process the batch, build up an XML fragment corresponding to its records
When an XML fragment (representing a batch of records) is complete, write it to the target XML file in append mode
Once all your batches are read, write XML footer to the target XML data file
Regarding point #2… databases have functionality called cursors that help you slide over datasets. No need to add new columns. Here’s some generic code. Note, it’s based on code that’s nearly 20 years old, maybe there’s a better way to ‘do cursors’ now.
Oracle Stored Procedure
CREATE PACKAGE BODY some_proc AS
PROCEDURE get_some_records (
field1 IN table.field1%TYPE,
field2 OUT table.field2%TYPE
...
cursor1 IN OUT GenericCurTyp)
IS
...
v_field2 table.field2%TYPE;
BEGIN
...
SELECT ...
INTO v_field2, field2
FROM ...
WHERE ...
Regarding point #4 – now you slide over your output object, appending XML fragments to it. WmPublic or custom Java services should help if the writes are to local disk. I’d had good experience writing multi GB files to a remote SFTP server by calling pub.client.sftp:put with mode=APPEND.
Thanks a lot for your tips! Yes I knew it was a bad practice to just pull out all data at once from DB. It’s that kind of request which comes up all of a sudden and you are asked if you can get it to PROD in 2 days max At some point, I was thinking about splitting the returned results in batches but yes it does not make sense to get all records at once. The thing is that my db user only has read rights so for now I will try to use offset pagination in a repeat and append to the resulting XML for each batch.
Although this isn’t a typical approach, I’ve seen Terracotta being used for large-file handling.
You can explore that option, subject to a number of analysis points, such as but not limited to, the nature and urgency of the data, criticality of the interface, frequency of changes, and so on.
For example, a medical research entity preparing product composition permutations and combinations datasets - a large dataset that is also nearly instantaneous.
Hello,
To give you an update, I implemented custom adapter services which also take as inputs the current offset and the number of rows to pull and the flows have repeat steps on success which call the adapter services until the query does not return anything and for each repeat sequence data is converted to XML and appended to a file. For one of the tables which have almost 1 000 000 records I am pulling 4500 records at a time. The result is that the file is generated in 1 hour give or take which is fine. There could be dependencies between rows (1 row from one batch could depend on rows from the next batch as here we are talking about invoices and invoice lines) but it’s a problem for another day
Thanks again for your tips!
n23
If you can extract invoice numbers in your 4500 row dataset, could a second JDBC adapter service not reliably get invoice lines for each ID? (SELECT ... WHERE invoice_id=...?)