PARSE XML with large XML files

Hello,

is it possible to parse large XML files line by line with Natural, kind of like a SAX parser in Java? The only approach I have found (in the documentation and in this forum) is to read the whole (!) XML file into a dynamic variable and call PARSE XML on it (which seems like a DOM parser to me). This works fine for small XML files, but for larger files (the magic file size in our environment seems to be around 30 MB) I get:

NAT1222 Memory required for statement execution not available.

The code I use:

DEFINE WORK FILE 1 #FILENAME TYPE 'UNFORMATTED'
READ WORK FILE 1 #XML
END-WORK
*
PARSE XML #XML INTO PATH #XML-PATH NAME #XML-NAME VALUE #XML-VALUE
  INPUT (AD=IO) #XML-PATH (AL=70) #XML-NAME (AL=70) #XML-VALUE (AL=70)
END-PARSE

I already tried reading the work file line by line and calling PARSE XML for each line, but then I get (probably because the XML fragment is not a valid XML document):

0350 NAT8311 Error parsing XML document

Could anyone tell me if it is possible to parse large XML files with Natural?

Best regards,
Stefan

I was able to parse a file of almost 45Mb by increasing the Work Area Size from the default 20Mb to 200Mb. I’m using Natural for Windows 8.3.1.

Natural Configuration Utility → Natural Parameter Files → NATPARM → Natural Execution Configuration → Buffer Sizes → Work Area Size (USIZE)

The Danish company register just converted their output format to XML as ONE big document = 1.3GB !!!

Most environments don’t support this size, so as a workaround I helped a customer create a “front-end” for breaking the XML into the actual “records” and then passed these to the parser.

More precisely:
Read a suitable chunck of data and feed this into a buffer until you have the relevant XML-elements, then pass this to the parser and delete it from the buffer - and then refill the buffer with the next record.

Finn

Hi Finn,

is this implemented in Natural or externally? How can I read a part of an XML file with Natural and provide a valid XML document to the parser? If I stop reading e.g. after a certain number of lines, the XML document is incomplete and Natural’s parser will not process it.

Best regards,
Stefan

The structure of the document is something like this














So I read and fill up the buffer until I have both the location of the start- and end-tag of
And then copy this section of the buffer to a dynamic string that I pass to the parser.
You of course have to generate a parser-subprog from a schema that only contains the section you want to parse.

  • all of it simple stringhandling and all done in Natural :wink:

Hello,

it seems to me that the main problem is the lack of a SAX-like XML parser in Natural :slight_smile: Both solutions (increasing the Work Area Size and writing your own “parser”) only work around the limitations of the current implementation of XML handling in Natural (DOM).

We use Natural for batch processing, which uses large work files simply due to the large number of processed records. How can it be that Natural only provides a DOM parser and not a SAX parser? I think the latter would be far more useful in a system like Adabas/Natural.

However, as we probably won’t be able to solve this problem ourselves, I think I’ll try to split the XML file up into smaller parts (externally) and then have Natural process them one by one.

Best regards,
Stefan

Hi Stefan
To my best knowledge the Natural parser IS in fact a SAX-parser !
The problem is that the only input for the parser is dynamic string, and that there is a limit to the practical lenght of this.

Perhaps someone could think up a variant that takes a work file as input ?!

Finn
BTW the string handling in Natural is not that tricky, so why split the process in two ?