I need to encode special characters like < and > and & and ’ and " in a webMethods service. To test it I have made this simple xml document:
<?xml version="1.0" encoding="utf-8"?>
<main>
<element> here with less than < greater than > ampersant & character simplequote ' dobblequote "</element>
</main>
which I assign to a string variable.
Then I let that string variable be input to the flow service
pub.xml:xmlStringToXMLNode setting encoding = utf-8 and isXML = true
the resulting node is then input to this service
pub.xml:xmlNodeToDocument where I map the output document to a document Document type.
However in this last step I get this error:
Launch started: 2022-03-02 07:33:34.242
Configuration name: encoding (1)
Configuration location: C:/Users/milun/workspace103/.metadata/.plugins/org.eclipse.debug.core/.launches/encoding (1).launch
com.wm.app.b2b.server.ServiceException: [ISC.0042.9325] Element <element> is missing end tag
at pub.xml.xmlNodeToDocument(xml.java:1037)
Funny thing is, that if I omit the & character from the small XML String value I created so it look like this:
<?xml version="1.0" encoding="utf-8"?>
<main>
<element> here with less than < greater than > ampersant character simplequote ' dobblequote "</element>
</main>
then it works.
Can anyone tell me why and also what to do about it?
the challenge is, that I receive a whole xml message in a String variable. That xml message may contain characters like & and < and >. I currently send this string value to an external party who then runs into trouble when he wants to change the string “xml” into an xml document due to the special characters. Therefore, I would like to encode those characters so that they do not cause trouble when trying to “convert” the string xml to a real xml message.
My test xml sample is this:
<?xml version="1.0" encoding="utf-8"?>
<main>
<element> here with less than < greater than > ampersant & character simplequote ’ dobblequote "</element>
</main>
You could use the service ‘pub.string:URLEncode’ to encode the string before mapping it into your document and then call documentToXmlNode. We also have ‘pub.string:base64Encode’ if you want to use that encoding istead.
However, the recipient will need to know what type of encoding you used.
regards,
John.
Set encode to true when calling this. That will encode any characters in any of the fields that need to be encoded. E.g. & to & and < to < Unless you know to 100% certainty that all values in a particular document will never have such characters, encode should always be set to true.
Proceed with caution if one uses this. The rules for URL encoding differ from escaping markup characters. Setting encode to true in the call to documentToXMLString is the way to go.
Side note: the doc for the encode input describes what is done and refers to it as “HTML encoding” but that isn’t quite accurate either as HTML encoding rules again are slightly different from XML encoding.
just some other questions which come up here:
From where resp. how are you retrieving the xml string variable?
How do you send the converted xml document to the external party?
Answering these questions might help us to determine further steps to be checked.
Replies about using encode in documentToXMLString aside, getting back to the original behavior when using the string from @Mikael_Lund …
As @Holger_von_Thomsen noted, this is not well-formed XML Nor is it valid XML – the & character cannot be there. It would need to be & But xmlStringToXMLNode parses it with no complaints.
However, proceeding to the next step, calling xmlNodeToDocument fails with
Malformed entity reference: & character simplequote ' dobblequote "
As expected. But this differs from the error you encountered. Is there a different string you used that generated the “missing end tag” error?
If you’re starting with an XML string for processing, it must be valid and should be well-formed before it is passed to xmlStringToNode, etc…
Thanks @toni.petrov for editing the original post to expose the markup properly. That’s a much different scenario.
But the issue is the same. The XML is malformed. It cannot have a plain & in the element value. That must be & for it to be valid and processed correctly. One should not try to URLEncode that string (or search and replace, etc.) to replace the & in that specific string. Doing that is not a good approach and for URLEncode it will not do what you want.
To emphasize an earlier note: if you’re going to start with an XML string, that string must be valid.
thanks for your reply. I receive the xml document as one big string value and I pass it on as such. However, sometimes the troublesome character & is present in that string. So one option would be to encode the whole string value but I don’t know if this is the way to go and also if this is the correct way to approach this. (Maybe first do a string replace of &ersantsemicolon to & and then a string replace of & to &ersantsemicolon to not mess up any correctly encoded &s ?).
Another option would be to convert the string into a known document (first pub.xml:xmlStringToXMLNode and then pub.xml:xmlNodeToDocument) and then do a replace of the & in the one tag where the problem lies. My challenge is though, that I don’t know of a smart way to do a replace on the value of this one tag without having to map each field in the whole document (any suggestions would be highly appreciated).
This means the system that is generating the XML has a bug. It is not valid to have an unescaped & in a value in XML. There is nothing reliable that can be done on the system that receives this XML to accommodate or correct it.
This will not work. Because the XML is invalid and the parser will not be able to accurately parse it.
There is no way. The source must fix their error. The recipient cannot do anything to fix it.
The ampersand character (&) and the left angle bracket (<) *MUST NOT* appear in their literal form, except when used as markup delimiters, or within a [comment](https://www.w3.org/TR/xml/#dt-comment), a [processing instruction](https://www.w3.org/TR/xml/#dt-pi), or a [CDATA section](https://www.w3.org/TR/xml/#dt-cdsection). If they are needed elsewhere, they *MUST* be [escaped](https://www.w3.org/TR/xml/#dt-escape) using either [numeric character references](https://www.w3.org/TR/xml/#dt-charref) or the strings " &" and "< " respectively
The reason this is a constraint is because it is impossible to reliably parse XML when a & literal is in element content. You might be able to get lucky, depending upon the specific XML being used, of doing a search/replace but there is a good probability that will fail you at some point in the future. The source system must fix their error.
Side note: try to avoid CDATA sections as a “work-around”. They can be somewhat painful to deal with and are almost never necessary.