To handle other language then english

Anand_Yadav1 · January 15, 2010, 3:50pm

If the data in the xml which generally should contain english characters, But if that contains other language e.g. Korean, then how it will be handled at WM .

Thanks,
Anand

Vikas_Gupta · January 15, 2010, 5:03pm

HI Anand,
webMethods is a pure Java application. Java supoprt n numbers of character set so webMethods will support same. If you are processing the data and saving in database or on filesystem then you have to think of whether the operating system or database where you are saving, support that character set or not.

Please explain you requirement.

devexpert · January 15, 2010, 5:47pm

use the encoding ISO-8859 in ur xml data, it support muliple languages and symbols.

-nD

Tong_Wang · January 19, 2010, 2:50am

Your xml should use one of the unicode encodings (UTF-8, UTF-16). It supports east Asian languages (and most of languages in the world).
ISO-8859 is mostly for EU lauguages.

reamon · January 19, 2010, 5:37am

The default character encoding of a JVM and of XML is UTF-8. Unless specified otherwise, these are the encodings that will be used during processing. If your Integration Server, JVM and OS combination are going to primarily work with something other than UTF-8, then it may make sense to change the default character set.

In conjunction with setting the proper default, you’ll want to make sure your integrations use the proper encoding at all points. Any service that accepts an encoding input should be reviewed to make sure you’re passing the right thing. This includes stringToBytes. bytesToString. documentToXmlString and others.

For flat files, you’ll need to get agreement ahead of time as to what the encoding should be. Files might have byte order marks (BOM) that indicate their encoding. In this case, the agreement with the partner should be that those will be present and if they are not, what to do (assume UTF-8, ISO-8859-1, etc. or fail).

ISO-8859-* does support multiple lanaguages and symbols but you need to know which ahead of time. ISO-8859-1 is the Latin-1 set, supporting Western European chars. ISO-8859-8 supports Hebrew. There are 15 variations. All are 8-bit (single byte) encodings and do not support Korean.

Search for “korean character encoding” and you’ll find the multiple encodings for Korean supported by Java.

Failure to properly handle character encoding will result in complaints about data being “corrupted” although it isn’t corrupted–it’s encoded using a character encoding that isn’t desired/expected.

DevNull43 · January 20, 2010, 4:56am

XML tags must common English characters, aside from that the content within the tags can be handled as all people explained in UTF8, which includes Korean, Japanesse and Chinese as example.

It’s always good to work directly in UTF8

reamon · January 20, 2010, 10:59am

“XML tags must common English characters…”

I’m not sure that’s true. From the XML Recommendation:

“Almost all characters are permitted in names, except those which either are or reasonably could be used as delimiters.”

[url]Extensible Markup Language (XML) 1.0 (Fifth Edition) defines the characters explicitly allowed in tag names.

DevNull43 · January 21, 2010, 1:59am

Ohh yes you are totally right: http://en.wikipedia.org/wiki/XML#International_use

However one of the MOST common issues comes while using a editor with for example ISO-8859-1 encoding, writing an UTF-8 document and writing for example an umlaut ä character.

Just saying that the file is encoded in UTF-8 in the XML declaration doesn’t mean that it is.

For example a small-u with umlaut is U+00FC, encoded in utf-8 that takes 2 bytes C3BC, and encoded in iso-8859-1 it is the 1 byte FC.

If you write the XML with a ISO-8859-1 editor it will write the char as 1-byte, even if you set XML header to UTF-8.

XML validator will expect a 2-byte char, and will give a validation error.

As recomendation I always ask to use regular english characters for elements and attributes.

But again you are totally right! Thanks for the rectification.

Regards.

reamon · January 21, 2010, 2:49am

You’re right that simply placing the UTF-8 attribute in the declaration isn’t sufficient. The encoding used must match the declaration. If the declaration states UTF-8 but the data is actually encoded in ISO-8859-1 (or vice versa) or any other encoding then one is just asking for trouble.

“XML validator will expect a 2-byte char, and will give a validation error.”

Strictly speaking, it won’t give an XML validation error per se. The XML processor will never see that character because the lower-level reader will throw an encoding exception since 0x00fc isn’t a valid UTF-8 character and is never the first byte of a multi-byte character. The reader won’t be able to convert the byte(s) to a UTF-8 character.

On the reverse, if the XML is encoded in UTF-8 but the reader is assuming (or forces) ISO-8859-1 then the lower-level reader will not complain. The XML parser will complain about BC being a character in a name (depending upon the parser and how well it conforms with the recommendation). The 2 bytes making up the UTF-8 ü will be treated as 2 separate chars–C3 (Â which is allowed in a name) and BC (¼ which is not allowed in a name).

Fun exploration!

Topic		Replies	Views
problems with characterset (o umlaut etc) EntireX	2	5606	April 2, 2021
How force specific charset in write work file Adabas-Natural , Natural , Natural-Code-Samples	5	1997	April 2, 2021
Natural and UTF-8 Adabas-Natural , Natural , Natural-on-Linux	9	10094	April 2, 2021
parser xml - NAT8312 Adabas-Natural , Natural , Natural-on-Mainframes , Natural-on-Windows-Unix	16	27934	April 2, 2021
Unicode in Tamino! Tamino	5	5815	April 2, 2021

To handle other language then english

Related topics