XML Parsing Vs Performance

Hi,

I have a XML file which is very large of having more than 100,000 BoM records. I am able to receive the file and convert to a document based on schema. After, I need to loop through the lines in XML and map the document structure based on an attribute value. I can do this using a loop and branch. The issue is the performance which takes more than 4 hours to complete the process.

Is there any optimized approach available to increase the performance?.

Thanks,
Jey.

What is the machine doing while this is going on?

Is it doing virtual memory disk I/O or are you just CPU-bound?

If you are getting slowed down by virtual memory I/O, you can try adding more memory.

However… Flow services are not an ideal place to do this kind of large-scale (100,000 records is in that category) processing. I would recommend that you use IS to route the document, but hand the document off to a Java service or even better, an external app server for processing.

Jey - My suggestion is since you’re talking about a single large XML file, search WM documentation and this website for “large document handling” practices, taking a stream oriented approach to handling data, etc. I myself haven’t used large doc handling, so someone else can help you better.

Sonam,

Wouldn’t you think that with 100,000 BOM records, that using large document handling would be very slow?

Regards

try queryXMLNode methods to handle large document. This way you will be able to extract only one line item instead of looping

Hi Mark - As I said, I’m no expert and I haven’t used those techniques myself. But it was a one very large document that Jey has problems with. So it would be worth investigating WM’s published techniques on handling large documents before moving to other approaches.

Even if you can do it with Large File Hanlding, it cannot even come closer to java service in terms of performance.

Ramesh: Similarly, Java services don’t come close to the simplicity and ease of maintenance of Flow services.

So it’s a trade-off worth investigating.

Sonam,

I’d agree that it’s always wise to try the simple solution first.
I assumed with 100,000 records (Especially with a BOM application that tends to be recursive ) that flow-based large-document handling was just out of the question.

Thanks for the great info you’ve posted on this site!

I’ve clocked some really fast results with a Java service that implemented a SAX parser. Definitely much more complex though.

Regards

Thanks for the kind words Mark

This discussion got me thinking, so I searched on Advantage and found this document which looks very useful:

GEAR 6 XML Handling Implementation Guide

GEAR_6_XML_Handling_Implementation_Guide.pdf

On page 23 it says:

The webMethods Built-in XML Parser

webMethods has implemented its own XML parser, but one may substitute this parser with any parser that is compliant with the SAX 1.0 APIs (http://www.megginson.com/SAX/SAX1/index.html). It is easy to plug another XML parser into the webMethods Integration server, as demonstrated by sample code that is distributed with the server. The server will use the inserted parser for parsing all of the XML documents that the server encounters.

If no other parser is inserted, webMethods Integration Server uses the webMethods XML parser. The webMethods parser has many advantages over traditional parsers.

It supports monstrously sized XML documents.

The parser supports several streaming technologies, including load-on-demand, parse-on-demand, and windowing. Load-on-demand and parse-on-demand allow XML documents to be loaded and parsed only as far as needed by the application.

If the application only needs header information from the front of a large document, it need not waste memory or clock cycles loading and parsing the whole document. Windowing allows applications to scan an XML document for the information it needs without holding the entire document in memory at once. The parser returns only the nodes that the application requires, and 100MB and larger documents can be completely parsed using less than 100K.

Pg 21 seems to have very useful information on parsing large files with Flow services.

-mdc Modified on 5/26/2005 to remove long URL

(Message edited by mcarlson on May 26, 2005)

Hi Sonam,

It’s been a while. Hope things Down-Unda are fine!

I’ve worked for one client who actually replaced the xml parser. Most people leave it alone and use the internal packaged version.

You pegged it right on the head when you described the streaming process.

What you want to do is get a node object, then call pub.xml:getXMLNodeIterator. If you set moving window to true, then it loads the document via stream, node by node as per your command.

Pass the iterator to getNextXMLNode, which will return a record that contains the element name and a node that contains the contents of the call that got that node. You will need to pass that node to a service like queryXMLNode (try not to use as it is expensive), or xmlNodeToDocument (better choice), which will allow you to extract the pieces that you want.

The only drawback that I have experienced is that it only retrieves nodes at the same level.

 
<document> 
     <header> 
     <po_lines> 
     <trailer> 
</documument> 

So in the example above, if po_lines has any children then they will be returned as a result of getting next node.

Also, the built-in services are flexible enough to allow you to get nodes by name.

I have used a repeat (on success) function to return the next node until the document is fully parsed.

Ray

Thanks for sharing your experience parsing large document Ray – it was extremely useful, as usual.

I found it pretty insightful how you used the REPEAT step to query “into” multi-level documents easily. My first idea would be recursion, but that has a penalty on IS. I once wrote a simple Flow service to test out recursion support - IS ran out of stack space by around 70 nested Flow calls. Of course, adjusting Java stack space would help, but the REPEAT step is a great solution. (I was wondering where it would be used)

Things are pretty cool down under (getting cooler all the time, thanks to the reversal of seasons in this hemisphere). Hope you and yours are well and hope to see you again sometime.

Hi,

I am new to this field and i have a problem with Large Document Handling(XML).we actually have a bunch of large XML docs,which we are trying to load in DB2 database by passsing thru TN.

Thanks,
Ramakrishna

Hi,

We have a very large XML documents(350+Mb),which we need to pass them through TN.We are converting the document to XML Node and then passing it to TN.Is there an procedure to handle documents of this large…when i pass the document to TN,it’s giving “OUTOFMEMORY” error.

Thanks,
Ramakrishna.

As stated earlier in this thread, check http://Advantage.webMethods.com for the TN doc on Large Document Handling.

I see a section in this doc:
TN Concepts Guide 6.5

There is also an example of using the WmPublic services for XML streaming in WmSamples sample.complexMapping.largeDoc. This should give you an idea on how to leverage chunks of documents in the downstream processing services.

An XML parse tree tends toward a factor of 10x the size of the raw XML. Xerces, Electric XML and the Integration Server parser are all in the same ballpark of overhead, so streaming is the only way to go and the IS parser is the only one that supports a moving window of the underlying raw XML stream, so the full document or parse tree never has to be in memory.

Cheers,
Fred

Edited by moderator to remove extra long URL for formatting purposes. -mdc

(Message edited by mcarlson on August 29, 2005)

Hi All,

I was wondering if schema validation can handle large files as well.
We have a 500 MB XML file, which we need to schema validate.
Reading the file as stream is not an issue but when we try to schema validate the entire IS server hangs.
We have high performance unix machines and we tried doing validation using SAX parsers by writting standalone java code and it take no more then 60 secs to parse the entire file.
Would appreciate your guidance on this.

Thanks
Ade

A 500MB XML file could easily be a 5G DOM, so the only way to reliable handle it would be to stream it in and validate chunks at a time.

If you send the top-level Node object, which is a wrapper on the streaming XML document, to validate that service will attempt to read in, parse and validate the entire document. It probably would take a while, then run out of memory.

Look at the WmPublic pub.xml:getXMLNodeIterator. If can operate in ‘moving window’ mode, so only part of the original file stream is in memory with it’s corresponding parse tree at any one time.

HTH,
Fred

Hi Fred.

Many thanks for your response.
Validation is chunks using iterator is what I precisely opted and the schema validation took around 12 minutes with 2 GB memory allocation to IS server.Another problem that poped out after validating this huge document is IS memory. It reached to 90% after the processing was over and remnained at this peak for quite a long time while nothing was being processed on server.IS beacame non respoinsive and I had to kill the instance.
I took a Heap dump and noticed following objects in the memory

Size Class Address

3,521,448 [56] 4 com/wm/lang/xml/TextNode 0x50155908
3,375,880 [72] 7 com/wm/lang/xml/ElementNode 0x50194688
3,023,152 [72] 7 com/wm/lang/xml/ElementNode 0x500e37e0
2,785,008 [72] 7 com/wm/lang/xml/ElementNode 0x5018b2d8
2,700,544 [72] 7 com/wm/lang/xml/ElementNode 0x5014e7f8
2,490,232 [72] 7 com/wm/lang/xml/ElementNode 0x500f3920
2,217,208 [72] 8 com/wm/lang/xml/ElementNode 0x501a94e0
1,977,920 [72] 7 com/wm/lang/xml/ElementNode 0x501764f0
1,872,040 [72] 7 com/wm/lang/xml/ElementNode 0x501adfd0
1,750,152 [72] 7 com/wm/lang/xml/ElementNode 0x50300358
1,710,856 [72] 7 com/wm/lang/xml/ElementNode 0x5018c6b0
1,557,416 [72] 7 com/wm/lang/xml/ElementNode 0x50300310
1,526,728 [72] 7 com/wm/lang/xml/ElementNode 0x501ace48
1,373,288 [72] 7 com/wm/lang/xml/ElementNode 0x5030b0e8
1,357,944 [72] 7 com/wm/lang/xml/ElementNode 0x501b0d48
1,212,176 [72] 7 com/wm/lang/xml/ElementNode 0x50321cc8
1,204,504 [72] 7 com/wm/lang/xml/ElementNode 0x501b2750
1,081,752 [72] 7 com/wm/lang/xml/ElementNode 0x50321e00
1,075,936 [56] 2 com/wm/lang/xml/TextNode 0x50176538

In my process flow I am dropping all the objects and after each validation iteration, still it looks like all the objects are not being cleaned up and causing memory issues.I manually executed the GC but that didn’t help either

Secondly, 12 minutes of parsing time is till a quite long time as compared to what my java colleagues show a 80 Sec response time for a complete 400 MB validation using SAX parse factory in standalone java program.

See attached html snapshot of my service.
and thanks again.
Ade

Schema Validation Service
schema_validateSchema_files.zip (19.9 k)

The performance results you are getting correlate with my experience as well.

I try to use ETL tools or (surprise!), SAX parsing for files this large, or big batches (100’s) of XML files.

I’d be interested to find out what is the fastest performance anyone has seen parsing a 500MB document with a node iterator approach.

If you try hard enough, you can drive a nail with a screwdriver, but the process can be painful :sunglasses:

I would not expect getting ETL tool performance, but I don’t expect it to be this bad either. It has been 6 years since I helped write the parser and the schema validate stuff didn’t exist at that time (only DTDs existed). It doesn’t look like the parser level is handling memory the way it should and I suspect there is an additional bottleneck in the validation layer.

Would it be possible to get the TransformPerf.xml file and the SAX program code for me to forward to our performance testing team? I can’t promise anything immediate, but having this scenario in the base performance report will force the problem into everyone’s face. Which, to me, always seems like a good way to get something fixed.

Send to support at webmethods dot com with text to forward to Fred Hartman (pardon for not sharing my direct email, I just don’t want some non-wMUsers sending me spam).

Cheers,
Fred