jtidy and Xerces-j, anyone?

I’m currenty fiddling around with some kind of harvester - the idea is to monitor web pages with a small java program that parses a web page using jtidy, converts it to wellformed XML and pipes it through Xalan. The problem is that jtidy has its own DOM implementation, while Xalan insists on having a Xerces generated DOM tree. Workaround through tempfiles is very ugly. Did anyone succeed with such a combination or do you know about other HTML parsers that XMLify dirty HTML pages without much effort? I know about the javax.swing.text package, but there seems to be a lot of work to achieve what jtidy does with a singe method invocation …

1 Like

We are working intensively with Xalan 1.2.x and we have most of the time an input which is a DOM generated from Xerces.

You might now that it is possible to configure Xalan processor to use Xerces DOM as an input.

Suprise! Guess what? Using Xerces DOM for the XSLTInputSource() is a lot slower than when going from the Xerces DOM to an XML serialization in a buffer and to then use that buffer for the XLSTInputSource().

This is because the internal DOM rebuild by Xalan from the buffer is highly optimized: it seams most of the strings are converted to integer and efficient lookup maps are built. This boosts the XSLT processing afterward.

→ my conclusion is: don’t hesitate to XML serialize your DOM before passing it to Xalan.

PS This should hopefully not be needed anymore with Xalan 2?

Xalan-Java-2 will accept any DOM as input, but there have been reports that transformation is much faster if you allow it to build its own tree. (The main reason is that sorting nodes into document order is very inefficient on a mutable DOM). The best approach is probably to stream the JTidy DOM into Xalan [or Saxon, which is faster :-)] as a SAX event stream. You can convert the Dom to a SAX stream using the identity transformer available from the JAXP 1.1 API.

Mike Kay

Thanks for the replies, I switched to Saxon which saved the day. Worked instantly.