Large flat file to XML

Hi, I have a flat file that is about 35MB and about 250k records.
I need to read that file in, perform some business logic and my output is an XML file of about 13MB.

a skeleton of my code is
Repeat(on Success)
pub.flatFile:convertToValues
mapping to build XML file
exit loop on null ffiterator
pub.xml:documentToXMLString
write out file(or post via https if possible)

in our development environment a 5MB file runs in about 13s, a 10MB file in about 1 minute. The 30MB file takes a little over 1hr.
When I move up to our test environment webMethods basically chokes on the 10MB file and becomes unresponsive.

I’m thinking full XML file is being held in memory but am unsure how to only keep part of it in memory and write the rest out to file.

This code writes a 193 MB file in 4 secs if the input is a byte[] but takes about 12 secs to write to a file if the input is a string. Try it and see if it works for you.

I never did it before but if you want to reduce the execution time further, I believe there should be a method somewhere in IDataCoder class that can convert IData object to byte[] directly without converting to string. Use that in conjunction with this service and pass the byte[] as the input to this service.

Inputs:
xmlData: input xml string
fileLocation: Target location to write the file.
(optional)bytes: byte representation of the xml string or plain text.

IDataCursor pipelineCursor = pipeline.getCursor();
        String    xmlData = IDataUtil.getString( pipelineCursor, "xmlData" );
        String    fileLocation = IDataUtil.getString( pipelineCursor, "fileLocation" );
        //Object    bytes = IDataUtil.get( pipelineCursor, "bytes" );
        pipelineCursor.destroy();
        
        //Clock start time
        long timeIn = new Date().getTime();
        try {
                //InputStream is = new ByteArrayInputStream((byte[]) bytes);                
                InputStream is = new ByteArrayInputStream(xmlData.getBytes());
                Reader in = new InputStreamReader(is);  
                Writer out = new FileWriter(fileLocation);
                while (true) {
                    synchronized (inChars) {
                        int amountRead = in.read(inChars);
                        if (amountRead == -1) {
                        break;
                }
                out.write(inChars, 0, amountRead); 
              }
             }
                in.close();
                out.close();
            //Clock end time
             long timeOut = new Date().getTime();             
             IDataUtil.put(pipelineCursor, "debug", "Custom buffered copy Time for a file of size :" + (int) new File(fileLocation).length() +" is "+(timeOut-timeIn));
             
        } catch (FileNotFoundException e) {
            e.printStackTrace();
            IDataUtil.put( pipelineCursor, "result", e.toString());
        } catch (IOException e) {
            e.printStackTrace();
            IDataUtil.put( pipelineCursor, "result", e.toString());
        }

Shared Code:

static final int CHUNK_SIZE = 100000;
static final char[] inChars = new char[CHUNK_SIZE];

Cheers,
Akshith

thanks Akshith, I’m not much of a java programmer.
I think what I am going to do is to split my doctype into 3 parts
1 is the header,
2 is the detail
3 is the trailer
I’ll read in my flat file using the ffitereator instead of reading in the whole file.
I’ll write the header to file
as I loop over my detail and perform my mapping/business logic I’ll write(append) it to the file and clear up my vars.
When finished I can write out the trailer.

I think that will keep my memory use down.

That is exactly the memory-friendly approach.

For additional speed you may find that processing a group of details at a time (instead of one at a time) might be faster.

Be sure to set buffer sizes on the file read and write to be reasonable as well for good throughput.

Yeah, i guess that will work too. As Rob mentioned above you might want to process them as a set of records per read/write instead of single record at a time.

Cheers,
Akshith

I set the code to write out the XML file in chunks, if I run in 500-1000 line chunks it takes about 3 minutes (vs 90 before) to run. If I split it into 2000 line chunks it takes about 5 minutes.

Sometimes though if the load is high it can still take 15 mins to process.
Thanks everyone for your help.