Character encoding problem

Dear all,

I have a character encoding problem in using Tamino API for java, below is the code I’m using,

-------------------------------------------------------------------------------
TConnectionFactory connectionFactory = TConnectionFactory.getInstance();
connectionA = connectionFactory.newConnection( databaseURI );
accessorA = connectionA.newXMLObjectAccessor(
TAccessLocation.newInstance( collection ) ,
TDOMObjectModel.getInstance() );
connectionA.setIsolationLevel(TIsolationLevel.UNPROTECTED) ;
connectionA.setLockwaitMode(TLockwaitMode.NO);

TXMLObject doc = TXMLObject.newInstance( TDOMObjectModel.getInstance() );
doc.readFrom( new InputStreamReader(new FileInputStream(filename), “big5”) );
accessorA.insert( doc );
connectionA.close();
-------------------------------------------------------------------------------

Assuming all the variables are defined correctly, and my xml is as below,

-------------------------------------------------------------------------------
<?xml version="1.0" encoding="big5"?>

???
(many many chinese words omitted)

-------------------------------------------------------------------------------

When I query the inserted record, all the chinese words appear normally, except the word “?”, which just become a “?” (square). In fact, there are also other words which will be encoded wrongly, but I can’t find all out.

But if the above xml is process by Tamino Interactive Interface, all the words appear normally, without anything wrong.

What’s wrong in the above java program? Am I somehow using wrong encoding?

Thanks a lot!!

Best regards,
Lun

Well, you describe the insert part of the program, which seems to be correct… but You have problems with the “query” part of the program, and this one is not in the description…
the API only returns utf-8 docs and part of your problems might be that you do not encode it back to big5 or this process has errors.

Could you please provide the query and printout part of your program?

Hi, the query is just “/article[@ID=230486]”, attached please find the output of Tamino Interactive Interface. What do you mean by “you do not encode it back to big5”, you mean in my query?? May I ask you to be more precise, for examples, what code I have to include?
test.gif

“you do not encode it back to big5” is related to the processing in the Tamino API for Java. It means the following: When you query Tamino Server, it will send the document back in utf-8, which will be converted to Java characters by the API. Now if you store this in a file or print it out, you need to tell Java, which encoding it should use. By the way, if you look at your posting from the interactive interface, it also says, that the returned document is in utf-8.

Now we have another suspicion here about the problem. Could it be that the characters, which are displayed as little squares are not really valid big5 XML characters?

From what we can see, the character causing the error could be \u88cf, which we couldn’t find in our big5.xml file but instead we found it in our EUC-KR test data. After doing a little test, this character cannot be “encoded” into Big5, and we guess that is the reason why you see a little square…

Thanks first. I think “encode it back to big5” is unrelated here, since I only use tamino interactive interface, not using Java API for the query, right?

Hm…I THINK the characters are valid big5. As if I use tamino interactive interface to upload the SAME document, the characters can be inserted correctly, and thus displayed correctly.

I’m afraid I don’t understand your test. So, is there any solution? Is it a tamino java api problem or my data problem? Why Tamino Interactive Interface can insert the document correctly??

Many thanks!

We are really puzzled with this problem. I think we need to reproduce what you did with your data. Please post the whole XML document, which causes the problem, as a .zip file attached to your next posting.

Hi Christian,

Really thank you very very much for helping me to investigate the problem!!
big5.zip (1012 Bytes)

We can reproduce the behaviour you described. It works with Interactive Interface, but not with TaminoAPI4J.
We then reduced your sample until only the reading from file into Java String and writing back out remains. The problem is still there, even TaminoAPI4J and Xerces are not involved anymore.

So I’m afraid you found a bug in Java itself here. Time to contact Javasoft, I guess.

We were able to reproduce this with JDK 1.3 and JDK 1.4.
Sorry, that we can’t help you here.

Oh! Thanks for reminding. Yes, InputStream is known bug in JDK, I should have known it, I saw it in the JSCP Exam book. Actaully, I first use FileReader, and discovered that all my inserted documents just contains “?”, so I switch to InputStreamReader and specify with a “big5” encoding, I can see most of my characters then.

The unbelievable is, my HD totally damaged yesterday. I have a new HD then, and load w2k from our image server, that means all environment (and encoding) is the same. God, use FileReader works now, everything works now. Although I’m not dare to say I havn’t made anything wrong in my last environment, while all the code(s) are the same (since my HD total lost, I download big5.zip in my previous post and change to use FileReader), it works now, programming is so … recondite sometimes.

doc.readFrom( new InputStreamReader(new FileInputStream(filename), “big5”) );
–>
doc.readFrom( new FileReader( new File(filename) ) );

Thank you very much Christian!

Hi
could you tell me which bug you’re talking about? It would be helpful for me.
And, I tried what you have said about the FileReader, and in my PC it does not work, mainly 'cause my locale is not so compatible with a big5 file.
Thanks
Javi

Hi, please accept my apology first. I should not use the word “known bug” as I have never go to any official website to check about. I just, by memory, remember I have ever seen an article, which is for the JCSP Exam, talking about InputStream and Reader class. A InputStream read bytes while a Reader read characters. As you may know, I’m also so puzzled about the character problem, otherwise I no need ask for help. But I remember the article stated that the InputStream class can’t correctly reads the characters, so we need to use the Reader class. I really forget where I found that article, you know, preparing for exam will read thousands of books/materials. Maybe it’s just a little problem, rather than a bug.

For your case, you may want to try

doc.readFrom( new InputStreamReader(new FileInputStream(filename), “MS950”) );

instead of using "big5"

Please forgive my uncertainty, I’m also achieve by try and error. :confused: