What product/components do you use and which version/fix level?
10.5
Are you using a free trial or a product with a customer license?
Customer License
What are trying to achieve? Please describe in detail.
Im getting a 4GB (It has 20 millions of rows) csv file from SFTP using pub.client.sftp:get then when in my flow service pub.io.streamToString is executed with pub.client.sftp:get output throws a ArrayIndexOutOfBounsException. I think is about the size’s file because other files with less weight are processing sucessfully.
Can you provide more details about your flow?
From the documentation for pub.client.sftp:get pub.client.sftp:get, the output is saved to a file if the optional input localFile is not provided and is saved as a contentStream parameter. In your case is the optional input not specified? Does the IS have enough memory to store this file in memory ?
You can try specifying a localFile to save it and confirm that the problem is not in the data returned.
Regarding the ArrayIndexOutOfBoundsException - you can get the stacktrace at the Error log , Admin->Logs->Error.
Do not use streamToString, it will not scale for this use case. You need to iterate over the stream, record by record. You could in fact use the flat file service “pub.flatFile:convertToValues”, which allows for iteration, that way you do not have to have the entire file in memory at the same time e.g.
Just make sure not to add the resulting records to an array, instead process them immediately and then drop them.
Equally do not try to do this on incoming stream from the remote server, instead save the stream to a local file and then read a stream from the local file. That way there is less of risk of interruption due to network issues. Depending on the work load a 4gb file might take hours to process. If in order processing is not important you might also want to delegate the processing to a child thread via a publish/subscribe.
regards,
John.
Hi, John, thank you for your reply.Iterate a stream work fine for me. Now I’d like to do the thing with child thread via a publish/subscribe, could you give more information about that, pls?
This is my flow:
-getFile /flow service to get the file to process in stream)
-REPEAT
–convertToValues (iterate = true, batcSize = 20000)
–loop over results
—map to inputList for batch insert adapter input
–adapterBatchInsert
–BRANCH over ffiterator to exit loop in case of null value
Replace the -adapterBatchInsert with a call to pub.publish.publish.
You will need to precede it with a map step to map the record to a document type.
You can then configure a trigger to subscribe to the given document and call a service, which would then call your -adapterBatchInsert.
Then configure the trigger properties to make it concurrent and specify the maximum number of processing threads to allow parallel processing.
Negatives:
You can no longer take advantage of batch processing if writing to a db.
Processing can get out of order, so if order is critical then don’t do this.
(1) can be mitigated if you group records together when publishing i.e. instead of publishing each record one by one, group them into batches and then publish.