IS Document for Mixed Content

Hello,

I am trying to find out a solution for my problem, any help is hugely apreciated:

I have XML Schema complex type that allows our clients to specify an address like:

1202 Jane St Apt #1012 Buzzer #7762 Toronto ON Canada

My target is to have the free text "1202 Jane St ", "Apt #1012 “, “Buzzer #7762” on separate IS fields. The only way I found right now is to have under IS address document a field *body but this field will collect all text fragments and will have a concateneted version 1202 Jane St Apt #1012 Buzzer #7762”.

The structure can’t be changed as is a requirement from stakeholder.

Thanks,

Ioan

I’m not sure there is an XML parser on the planet that won’t do the same thing. Of course, someone correct me if I’m wrong.

But, there may be a way to do it using XQL. Query for the source of the node to get the raw text. With that, you have a variety of options to parse out the street address.

Of course the “real” solution is to modify the XML to define Street1, Street2, Street3 instead to remove any ambiguity and not rely on unreliable delimiters but looks like your hands are tied? Mixed content is a poor practice.

It isn’t a “requirement” I’m sure, but rather a design decision that is now something of a backwards-compatibility constraint. Perhaps a way forward is to get them to add the new fields, populate them appropriately, and continue to plop in the text in mixed content mode. You’d get what you need from the structured fields, others would get it from the mixed content text and you’d ignore it in *body altogether. Eventually, once they’ve given everyone a chance to convert to the structured fields, they could eliminate the use of mixed content.

Yech. That’s ugly. Can you post the XSD that describes how this data is constrained?

That “requirement” comes from Infoway which is a governmental organization in Canada in charge with implementing Electronic Healthcare Records using HL7. HL7 is a well know standard but Infoway have some weird requirements, don’t know if for backward compatibility or just poor design. It is very difficult to change these requirements; too many actors are involved in projects all over Canada.

Anyway, it is a valid XML construction even though I don’t like the design at all (yes, it is really ugly)! XSD complex type for address is quite ugly, as well, but the constraint comes as a requirement (business rule) in top of XSD constraints:

Begin Quote>

The supported address parts are “city”, “state” (province),
“postalCode” and “country”. Other part types are not permitted. All
other address information is sent as plain text, separated by
delimiter tags. There may be up to 4 lines of delimiter-separated
information in addition to the specified address parts. Both address
parts and delimiter-separated text are constrained to a length of 80
characters.

….
Example:
<addr use=“H PST”>Some CompanyApt A5, 123 Some Street
N.W.Edmonton, Alberta A1B
2C3Canada

<End Quote.

XSD fragment (some dependencies not displayed as ADXP):
(the complete XSD is too long to be posted on this Forum)

<?xml version="1.0" encoding="UTF-8"?>

<xs:schema xmlns:xs=“XML Schema” elementFormDefault=“qualified” attributeFormDefault=“unqualified”>
<xsd:complexType name=“AD” mixed=“true”>
<xsd:complexContent mixed=“true”>
<xsd:extension base=“ANY”>
xsd:sequence
<xsd:choice minOccurs=“0” maxOccurs=“unbounded”>
<xsd:element name=“delimiter”>
<xsd:complexType mixed=“true”>
<xsd:complexContent mixed=“true”>
<xsd:restriction base=“ADXP”>
<xsd:attribute name=“partType” type=“AddressPartType” fixed=“DEL”/>
</xsd:restriction>
</xsd:complexContent>
</xsd:complexType>
</xsd:element>
<xsd:element name=“country”>
<xsd:complexType mixed=“true”>
<xsd:complexContent mixed=“true”>
<xsd:restriction base=“ADXP”>
<xsd:attribute name=“partType” type=“AddressPartType” fixed=“CNT”/>
</xsd:restriction>
</xsd:complexContent>
</xsd:complexType>
</xsd:element>
<xsd:element name=“state”>
<xsd:complexType mixed=“true”>
<xsd:complexContent mixed=“true”>
<xsd:restriction base=“ADXP”>
<xsd:attribute name=“partType” type=“AddressPartType” fixed=“STA”/>
</xsd:restriction>
</xsd:complexContent>
</xsd:complexType>
</xsd:element>
<xsd:element name=“county”>
<xsd:complexType mixed=“true”>
<xsd:complexContent mixed=“true”>
<xsd:restriction base=“ADXP”>
<xsd:attribute name=“partType” type=“AddressPartType” fixed=“CPA”/>
</xsd:restriction>
</xsd:complexContent>
</xsd:complexType>
</xsd:element>
<xsd:element name=“city”>
<xsd:complexType mixed=“true”>
<xsd:complexContent mixed=“true”>
<xsd:restriction base=“ADXP”>
<xsd:attribute name=“partType” type=“AddressPartType” fixed=“CTY”/>
</xsd:restriction>
</xsd:complexContent>
</xsd:complexType>
</xsd:element>

                    ..................................
    </xsd:complexContent>
</xsd:complexType>

</xs:schema>

Java DOM can handle mixed content with multiple Text nodes. Thanks for XQL tip, I have to see if is smart enough to bypass IS Doc limitations.

package org.dom;
import java.io.ByteArrayInputStream;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.w3c.dom.Text;

public class MixedContent {
public static void main(String args) {

    DocumentBuilderFactory factory =
          DocumentBuilderFactory.newInstance();
        try {
          DocumentBuilder builder =
            factory.newDocumentBuilder();
          String address = "<address>" +
                  "  1202 Jane St <delimiter/>" +
                  "  Apt #1012 <delimiter/>" +
                  "  Buzzer #7762 " +
                  "  <city>Toronto</city> " +
                  "  <province>ON</province> " +
                  "  <country>Canada</country>" +
                  "</address>";
          
          ByteArrayInputStream is = new ByteArrayInputStream(address.getBytes());
          Document document = builder.parse(is);
          Element root = document.getDocumentElement();
      
          System.out.println("Root:" + root.getNodeName());
          
          NodeList list = root.getChildNodes();
          for ( int i =0; i < list.getLength(); i++ ) {
              Node nodeItem = list.item(i);
              if ( nodeItem.getNodeType() == Node.TEXT_NODE ) {
                  System.out.println(" Text Node ["+i+"]" + ((Text)nodeItem).getData() );
              } 
          }
        } catch (Exception e) {
            e.printStackTrace();
        }
}

}

Console after running this program:

Testing DOM …
Root:address
Text Node [0] 1202 Jane St
Text Node [2] Apt #1012
Text Node [4] Buzzer #7762
Text Node [6]
Text Node [8]

“Java DOM can handle mixed content with multiple Text nodes. Thanks for XQL tip, I have to see if is smart enough to bypass IS Doc limitations.”

Good point. Thanks for refreshing my memory on the text nodes! I mistakenly attributed this behavior to the parser, rather than to the IS conversion of the DOM to an IS doc.

The XML is definitely valid. It was not my intent to imply otherwise. It’s just that mixed content is trying to apply structure to what is essentially partly unstructured data–“here’s some fields that have names, and here’s some other stuff along with it but it doesn’t have a name of its own.” Such a structure can be error prone.

It appears in the example and the description that the text follows some rules, which is a Good Thing. But a better thing would have been to formalize those rules by creating additional elements to remove any ambiguity. Oh well.

IS uses DOM too. But as you pointed out it is taking the additional step of concatenating all text that occurs within the tag into the *body field.

The pub.xml:xmlNodeToDocument service has an option named mixedModel to change the behavior, but not in the way you’re looking for. Have you opened a service request with support yet? Perhaps they have a technique.

This also sounds like a good feature request–instead of concatenating all the text nodes together, have an option to create a string list.

For the XQL query, you may be able to specify a query path that gets you to the individual text nodes, but I’m not sure. If you figure out the query syntax for that, please post the query–I’m quite interested to know the solution!

I am very familiar with Java and webMethods and I find that it is easy to use either option. For IS Java Service, I wrote some ant scripts that automatically synchronizes Eclipse projects with webMethods jars directory and java sources for java flows (fragging/compiling). Usually I use Developer to create the IS java service; Eclipse is instructed to automatically pick up new java files (using a Linked Source Directory) and from there on Eclipse is used to write the rest of code, of course you have to pay attention to not screw the formatting. Also, I could debug using Remote Debugging starting the flow from Developer (IS has to be started with some extra flags) and then pass the control to Eclipse Debugger where you can go step by step.
One way to design Java Services, is to write Java modules that are not dependent on Wm jars and use Java Services just to wrap those modules. This way you can test w/o depending on webMethods using well defined tools like JUnit/JMock/DynaMock etc. Also you can keep under SCM (Source Control Management) your java code.

Is IS Java Service more error prone than IS Flow Service? No if you are doing the right things (aka proper Unit Testing); it may depend if you are more familiar with one or other.
One issue I have with Flow Services, is hard to merge two different versions and SCM is a nightmare specially when you have many different versions in Production and you have to support all of them, branches etc.

Saying this, I rarely choose to use Java Services for Production packages; mostly for design time to generate automatically IS Doc Type, Flow Services for mappings etc. If this gives me troubles, all I have to do is change the “generator” to create Java Services instead of Flow Services but I never had yet this problem.

You don’t know how much I would like to have this option, I am trying to push for a variant to allow at least just one line as free text instead of 4 with <delimiter/> between; in this case *body should be enough.
 mixedModel – true will give you *body, if is false you won’t get value into *body. Not what I am looking for.

Regarding XQL options, there is no much you can do, basically //address/text() will give you directly what *body gives to you, the concatenated value of all text nodes. If you try to use source(), you have to have xmlNode not IS Document. And then you have somehow to use Java DOM parser/custom parser to extract text nodes. This solution is too complicated as the address type is used in many places into incoming document. 

One ugly solution (for this ugly requirement) that I found is use string.replace() upon XML content before initial parsing and replace <delimiter with ${del}<delimiter . After parsing I will get into *body something like “1202 Jane St ${del}Apt #1012 ${del}Buzzer #7762". When I am mapping to CanonicalDoc I can apply a tokenize for *body using ${del} as delimiter and I will get my separate lines. When I map back from CanonicalDoc I have to do the reverse, put in the *body “1202 Jane St ${del}Apt #1012 ${del}Buzzer #7762" and after transforming IS Document to string, replace ${del} with . Don’t like it, but is the only solution that works so far.

I will put a service request for WM, but I am not sure if they will have a solution any time soon!
You don’t know how much I would like to have this option, I am trying to push for a variant to allow at least just one line as free text instead of 4 with <delimiter/> between; in this case *body should be enough.
 mixedModel – true will give you *body, if is false you won’t get value into *body. Not what I am looking for.

Regarding XQL options, there is no much you can do, basically //address/text() will give you directly what *body gives to you, the concatenated value of all text nodes. If you try to use source(), you have to have xmlNode not IS Document. And then you have somehow to use Java DOM parser/custom parser to extract text nodes. This solution is too complicated as the address type is used in many places into incoming document. 

One ugly solution (for this ugly requirement) that I found is use string.replace() upon XML content before initial parsing and replace <delimiter with ${del}<delimiter . After parsing I will get into *body something like “1202 Jane St ${del}Apt #1012 ${del}Buzzer #7762". When I am mapping to CanonicalDoc I can apply a tokenize for *body using ${del} as delimiter and I will get my separate lines. When I map back from CanonicalDoc I have to do the reverse, put in the *body “1202 Jane St ${del}Apt #1012 ${del}Buzzer #7762" and after transforming IS Document to string, replace ${del} with . Don’t like it, but is the only solution that works so far.

I will put a service request for WM, but I am not sure if they will have a solution any time soon!

Hi.

are you looking custom data management software ?. We provide Custom Data management software service as per client requirement so if you have any question about clinical data management software are welcome.

Thanks in advance.