I am trying to gauge how much memory will be used when loading XML documents into memory. Suppose I have an XML document of size X kilobytes or has Y kilobytes of information
Is there a good way of determining how much memory would be used in loading it into a record structure?
I want to determine the upper limit of how large an XML can be loaded into a wM record structure. I suppose this will depend on the heap size, but I need to understand this a bit more.
There are three major options that drastically effect RAM usage:
-
If using the NodeIterator services in WmPublic pub.web, there is no upper limit. You can see an example of using these services in WmSamples large mapping example.
-
If you don’t need all the data, but just a few bits of information you can use pub.web:queryDocument to pull out pieces of the document into a record that just contains the subset you want using XQL or WQL queries. This will require an XML parse tree to be in memory, but little else.
-
Often you need all the elements of the incoming document and you use pub.web:documentToRecord to create an IData image of the data for mapping etc. This operation has a full XML parse tree in memory and then an IData tree that is about the same size. Assuming you then drop the Node object (i.e. the parse tree) from the pipeline as soon as documentToRecord completes, the parse tree RAM can be GCed, so there is a peak you need to account for, but the memory would not be held.
Now for a ROUGH look at the XML parse tree RAM usage that happens during option #2 and #3:
An IS XML parse tree is a structure built on top of the original XML stream, meaning that the original file is in memory, so the disk size of the file is always there. There will be Node objects for each element in the XML file. A Node object contains a parent reference, a child reference list, an attribute list, name, namespace name, nsdecls list, and some other information needed to encapsulate the XML information set. I don’t know the RAM size of this object, but I’ll take a wild guess at NodeSize.
We could then do the following calculations:
Assumptions:
NodeSize = 200 Bytes
DiskSize = 1 Meg
DataPercent = 70% (the other 30% is XML tags)
AvgTagNameSize = 10
Two tagNames per Node
Two extra characters per tagName
Formulas:
NumberNodes = (DiskSize * (1-DataPercent))/((AvgTagNameSize+2)*2)
RAMSize = DiskSize + (NumberNodes * NodeSize)
Calculations:
NumberNodes = (1MB*.7)/24 = 13107
RAMSize = 1MB + (NumberNodes*200B)
RAMSize = 1MB + 2.6MB
RAMSize = 3.6MB for a 1MB file.
I haven’t run real files through these formulas to check the assumptions, but I think these assumptions and formulas will get you in the ballpark.