Our partner is sending us a xml string with some special character at the begining, although it’s not failing in IS ( seems IS just can parse it to a node, even it’s not valid xml), but it’s failing in other system which using schema to do the parsing.
I like to know if there is a service in IS to do just basic xml validation to prevent non-well-formed xml enter the system.
A sample of this failed doc like this:
You can use WmPublic/pub.schema:validate to validate the xml.
I don’t want to do full schema validation, only want to check if it’s well formed/valid xml.
Invoke pub.xml:xmlStringToXMLNode and try to give an invalid xml string and set isXML to true, and it automatically throws the error.
Ramesh, for case the extra character is outside of the root tag like the sample I gave, pub.xml:xmlStringToXMLNode won’t fail, it still will generate the node.
Thanks for your comments anyway.
But XMLNodeToDocument will fail,did u tried this?
But as long as the xml comes with <?xml> it will not fail and still parses the structure.
queryXMLNode will throw error.
Ramesh, I tested with queryXMLNode, it still not failing.
I think once the node is generated, it already trim off the error characters.
if its always coming with the same extra characters, why not try using string replace and taking those chars out…
I don’t think that it is unreasonable to expect a partner to send you well-formed XML. My first course of action would be to work with them to correct the root cause, before jumping through a lot of hoops to fix their problem for them.
Since you have stated that you can create a valid IS XML node from the string they are sending, why not just convert that node back into a string? The extra characters should be gone now, right?
That make sense, but, we want to flag the error so we can contact the partner to let them fix the problem. Although, not many partners are sending these kind of extra characters, we want a generic solution to handle any new cases in the future.
Manju, replacing is not acceptable, we may replace the same character in other part of the payload.
Seems, we don’t have a xml check service in IS, I may look at Java resources to do that.
In this case your special characters actually appear outside the root node of the document. I believe, this is what allows IS to create a valid node rather than to throw an exception. If you have malformed XML anywhere inside the root node, you get an exception every time. You can test this by removing or misspelling an end tag.
Given IS’ correct handling of malformed XML inside the root node of a document and your partner’s inclusion of extraneous characters outside of the root node, one workaround would be to create a substring of the xml string from the characters up to, but not including the “<?xml>”. If this string is not empty, then you can reject the document as invalid.
pub.xml:xmlStringToXMLNode and XMLNodeToDocument are indeed the services to enforce well-formed XML (or allow non well-formed). xmlStringToXMLNode does some basic validation (e.g. matching tags) but it allows data leading up to the prolog/first tag and ignores it. In other cases, docs that are not well-formed generate errors.
I’ve seen threads that describe various parsers that ignore whitespace in front of the prolog, something that is supposedly against XML rules but I can’t find anything that says that explicitly. The parser in Integration Server clearly allows any amount of junk prior to the prolog. It will even successfully parse with leading junk in front of the first tag when no prolog is present. Very forgiving.
IMO, Mark’s suggestion to just strip the leading junk using built-in services is the path of least resistance. If you really, really need to detect leading junk, detecting that the first character is a ‘<’ should be sufficient–xmlStringToXMLNode and xmlNodeToDocument should be able to do the rest. Or you could swap out the XML parser and use one that detects leading junk but that’s most likely more hassle than it’s worth.
Thanks, Mark & Rob.
I think I will use the approach of detecting first character ==‘<’ or not. This will have least performance impact which is critical for us.
Thanks to everyone who tried to help.