Xml Document with unresolvable entity reference

We tried several methods for loading the document below from a file to a .NET XmlDocument and failed with all of them, when we do not have access to the DTD with the entity definition for “/”.

<?xml version="1.0" encoding="UTF-8"?>

<onlaw.inhalt>abc / done</onlaw.inhalt>

Storing the document to Tamino with a stream based method works. On retrieval the content is:

<?xml version="1.0" encoding="UTF-8"?>

<onlaw.inhalt>abc done</onlaw.inhalt>

How does the .NET API for Tamino load the document? Can you provide us with sample code for loading this XML?


Thanks in advance,
erwin

Unfortunately that is XML for you.

If an XML parser encounters an undefined entity reference it is obligated to report a fatal error.

The streaming interface bypasses the .NET XML parser.

How did you get the undefined entity reference into Tamino? Are you able to supply test code?

[This message was edited by Mark Kuschnir on 24 Oct 2003 at 10:16.]

The code below uses a stream “based” method for XML document import.

TaminoCommand command = connection.CreateCommand(“XCms”);
FileStream fstream = new FileStream(fn, FileMode.Open);
TaminoDocument tdoc = new TaminoDocument(fstream, “text/xml”);
tdoc.DocName = “test1”;
tdoc.DocType = “onlaw.inhalt”;
TaminoResponse response = command.Insert(tdoc);
fstream.Close();

Is this a valid way for importing data into tamino (it obviously works). Anyway life would be much easier for us if we had some import methods bypassing the XML parsing on our (cient) side.

Kind regards,
erwin

The import of non well formed documents into Tamino should produce the error: INOXPE8711: Document not well-formed.

You shouldn’t be able to import documents into Tamino with undefined entities for the same reason that the .NET parser doesn’t like them - because they are not XML.

Having looked at your code it seems that your use of the API is not quite correct. The TaminoDocument constructor you are using is only for non-XML (binary) documents. This means that the Tamino server + the API are not really aware that you are passing non well formed documents.

Mind you I don’t know where the / disappeared.

The API should throw exceptions or produce error codes when used in an incorrect way. From our point of view it is a real restriction if the only correct way for importing has to use loaded XmlDocuments (large document problems, performance, …). It also complicates server development (“XmlDocument” is not serializable).

Anyway I modified the sample document to contain valid XML:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE onlaw.inhalt [ ]>
<onlaw.inhalt>abc / done</onlaw.inhalt>

Then I loaded the document to an “XmlDocument” using the “XmlValidatingReader” with entity handling set to “ExpandCharEntities”. Storing the document back to the filesystem reproduces the correct content with entity reference: “abc / done”.

After Tamino insert/retrieve for the document the entity reference is lost. All of the methods I tried (Retrieve, RetrieveStream, RetrieveReader) produced “abc SOL done”.

How can I preserve the entity reference with Tamino?

Regards,
erwin

I tried this locally and it appears to work. Say you have a document of the form:

<doc>&ent;</doc>



I created a TaminoDocument and set the content type to “application/octet-stream”. I inserted and retrieved the document without a problem. However, I was storing and retrieving it as a binary document.

I think you should be able to reproduce the behaviour when working with the .NET API and the XML insert/retrieve methods.

Sorry for the confusion. I meant I tested it out locally with the .NET API against a Tamino 4.1.4.1.

I was using the .NET API with the Insert + Retrieve methods. It should work for you if you use “application/octet-stream” and NOT “text/xml”.

We need the document with all the XML-functionality in Tamino. When inserting with type “application/octet-stream” I think Tamino will not treat it as XML.

The code below is used for loading the XML with the .NET API.

XmlTextReader reader = new XmlTextReader(fn);
XmlValidatingReader vReader = new XmlValidatingReader(reader);
vReader.ValidationType = ValidationType.None;
vReader.EntityHandling = EntityHandling.ExpandCharEntities;

XmlDocument doc = new XmlDocument();
doc.Load(vReader);
reader.Close();

doc.Save(“outFn”);

Then content for “outFn” is “… abc / done”. OK!
Then we insert the document into Tamino and retrieve it.

TaminoCommand command = connection.CreateCommand(“XCms”);

TaminoDocument tdoc = new TaminoDocument(doc);
tdoc.DocName = “test1”;
tdoc.DocType = “onlaw.inhalt”;
TaminoResponse response = command.Insert(tdoc);

TaminoDocument rDoc = command.Retrieve(new TaminoUri(“onlaw.inhalt” + “/” + “test1”));
XmlDocument xmlDoc = rDoc.XmlDocument;

xmlDoc.Save(“outFn1”);

Then content for “outFn1” is “… abc SOL done”. PROBLEM!

Think about some type of editorial office, where Tamino is used as XML storage. Editors retrieve documents from tamino, modify the documents in some way (using an XML-editor like Arbortext Epic) and store them back to the database. On retrieval they need the data with entity references and not with resolved entities.

Regards,
erwin

It is currently not possible to do this with Tamino just using text XML documents.

This is partly related to the way that XML is defined. It is also related to the fact that Tamino is not a file system but an XML document store. For example see this proposed enhancement to XML to permit XML entities to be defined without expansion: http://lists.xml.org/archives/xml-dev/200310/msg00566.html.

There are couple of possible workarounds:

(i) you could store the documents in parallel - one in XML and another as binary
(ii) you could do preprocessing on the documents during storage and retrieval: storing

 &ent; </pre> as <pre class="ip-ubbcode-code-pre"> &-amp;ent; 



Note that I had to add a ‘-’ in the ‘&-amp;’ definition as otherwise it just appeared as ‘&’ and they looked the same.

P.S. thanks to SI Trevor Ford for the help.

Our customers typically produce technical documentation for aircrafts, ships or submarines. They just started working with XML and used SGML in the past. They are working with DTDs and large numbers of entities and entity references. We are currently on the way to build a standard CMS solution for them.

From my point of view, the Tamino behaviour concerning this topic is a real deficiency. The proposed workarounds produce new problems (duplication of data, not indexed entity contents, …).

Tamino should preserve the entity reference information in the document (for later retrieval) but use the resolved entity data for indexing.

Currently seems as if the entity handling is a “ShowStopper” for the usage of Tamino in technical documentation environments.

Thanks,
erwin

An observation on this from a standards perspective. The XPath/XQuery data model does not permit unexpanded entity references. There are good technical reasons for this, though it does sometimes cause users problems (the problems are there in XSLT too). The circumvention that seems to work for many people is to store the documents with a processing instruction in place of the entity reference. This substitution can be done using a non-XML-aware editing process, e.g. using Perl. It does mean that you can’t search on the expanded text of the entity, but it also means that you can reconstitute the entity references easily when you want to do subsequence editing.

I hope this helps.

Michael Kay