Natural and UTF-8

I think, from ADABAS point of view, it’s irrelevant if the characters are stored in ISO-8859 or UTF-8 format. But what about Natural? How would Natural display an UTF-8 encoded umlaut or other special characters?

As a first test, I write the following simple program:

write
  'UTF-8 encoded umlaut    u is:' H'C384' /
  'UTF-8 encoded ligature sz is:' H'C39F' /
  'UTF-8 encoded euro symbol is:' H'E282AC'
end

This works quite good if I set my terminal-emulation to UTF-8 first! But at the end of the line, some characters from the previous screen are still displayed.
The next problem comes with the maps. For example: If I write a map with the EURO-Symbol in UTF-8 on it, it would be written into the map-source with a length of 3 byte, but it would be displayed with a character-width of one. This causes problems in the map-editor. For example it’s impossible to fill a map with a width of 79 with 79 characters. The next thing is: It’s almost impossible to edit such a text-constant.

Here’s my test-map:

[code]

  • MAP2: PROTOTYPE — CREATED BY UNIX 6.1.1 —
  • INPUT USING MAP ‘XXXXXXXX’
    FORMAT PS=003 LS=080 ZP=OFF SG=OFF KD=OFF IP=OFF
  • MAP2: MAP PROFILES ***************************** 200***********
  • .TTAAAMMOO D I D I N D I D I ?_)^&:+( *
  • 003079 N0NNUCN X 01 SYSDBA NR *

INPUT ( IP=OFF /*
)
001T ‘----±—1----±—2----±—3----±—4----±—5----±–’-
‘-6----±—7----±—’
/
001T 'Here comes the EUR-symbol:

Hi Matthias,

There were a couple of articles on Natural and UNICODE in the last newsletters (and there is one more coming up in June), they are archived at http://developer.softwareag.com/ets/knowhow/Nat_knowhow.htm.

Maybe you can get some information from them.
Steve

Hello Steven Wild!
Thanks for the link to the Unicode-documents. Here are some excerpts:

At the moment I’m working with Natural 6.1.1 for Solaris. It seems, that I have to wait a little bit …

If I understand this correctly, Natural uses UTF-16 and all related statements (like EXAMINE, MOVE SUBSTR) are adapted to handle 2 bytes per character. But UTF-16 does not mean, that every character can be represented by 2 bytes. Characters above U+FFFF are represented by 4 bytes (surrogate pair).
See: http://en.wikipedia.org/wiki/UTF-16/UCS-2#Examples

So my question is: Will Natural 6.2 (Open Systems) support UTF-16 with or without surrogates?

That’s the reason why UTF-8 became the quasi-standard for Unicode representation.

Next question: Does Natural 6.2 only support UTF-16?

This could be a problem, because most of XML-Documents I’ve seen until now are using UTF-8. Even my PuTTY-terminal-Emulation can only handle UTF-8. OK, it would be possible to write a converter for the XML-Files (but I will not do this in Natural)…

Just for your information :

Natural 6.2 is available on Solaris already,

Peter

Yes, I know. But I’m not the one, who does the installations here. So I have to wait … :wink:

Matthias wrote:

Next question: Does Natural 6.2 only support UTF-16?
<

Natural introduces with Nat 4.2 and Nat 6.2 a new format ‘U’. Natural Unix, Win handles internally this format in UTF-16. But just for ‘U’ variables. All other variables are treated as before.

From my point of view it would be the best to read first the documentation, because there are many questions answered.

First of all: I don’t have any documentation of Nat 6.2. I only got the links to the documents that Stephen Wild mentioned above. There I read about the new U-Format in Natural (which is UTF-16) and the corresponding new W-Format in ADABAS (which is UTF-8 ). The conversion between Natural and Adabas is done automatically.

From my point of view UTF-8 is the quasi-standard. And my question still remains: Does Natural 6.2 only support UTF-16?
In other words: What about Workfiles and XML-docs with UTF-8 encoding?

Hello Matthias,
Natural internally stores Unicode characters in UTF16 format. This allows quick scanning in a Unicode string as each character starts at a fixed location.
The MOVE statement was enhanced to allow conversion from one code page to another. Please refer to

MOVE ENCODED

You may specify UTF8 als source and UTF16 as target and vice versa, allowing the conversion.
Surrogate characters in UTF16 require 4 bytes of storage and must be handled in pairs of U-characters. The EXAMINE statement has been enhanced to detect surrogates.

EXAMINE [FULL [VALUE [OF]]] {op1 | SUBSTR(op1,op2,op3)}
[POSITION-clause] [FOR] 
[CHARPOSITION op4] [CHARLENGTH op5]
[[GIVING] POSITION IN op6] [[GIVING] LENGTH IN op7] 

With best regards

Thanks for answering! So Natural 6.2 supports UTF-8 …

http://techcommunity.softwareag.com/ecosystem/documentation/natural/nat621win/sm/move_0860.htm#syntax8_move
http://techcommunity.softwareag.com/ecosystem/documentation/natural/nat621win/unicode/uni-language.htm

BTW, I have to memorize the following:

Phew! There a programer can make really bad mistakes.