Large Document Handling

I am implementing large document handling and can’t acquire the document data. I’ve sending to a service that calls wm.tn.doc:recognize. I can see the bizdocenvelope data however content is null - which I might(?) expect if the document is written to disk. I then pass the bizdocenvelope to the bizdoc parameter of getContentPartData (as outlined in wM documentation) with Partname=xmldata and getAs=stream. However, I get an EXMLException at this point. How do I get a reference to this content so that I can extract it from disk?

Couple of checkpoints

  1. Have you changed the TN config parameters as mentioned in the TN Large document.

  2. Using recognize and routeBizdoc service will persist the TN document.

  3. In TN Console, look into the persisted document by doubleclicking, and then you will see ‘Storage Type’ and ‘Storage Reference’ parameters filled in the Content Tab indicating it is a large document.

  4. Once this is confirmed, use getContentPartData service and provide the partname as you see it listed in TN console->Content window of the document. Btw, the content variable in Bizdocenvelope will be null for a large document. You have mentioned partname as ‘xmldata’. Please verify that by looking into TN console. It maybe different. ‘Stream’ value is correct.

Afer all this, if it still doesnt work, please post the error message so that we may get more hints.

  1. Yes.
  2. Check.
  3. Check.
  4. Check. getContentPartData still throws this error:

com.wm.app.tn.err.EXMLException: <exmlexception>
<errorcode></errorcode>
<info>wm.tn.doc:getContentPartData</info>
<originalexception>
<javaclass>com.wm.app.tn.err.EXMLException</javaclass>
<message><exmlexception>
<errorcode>TRNSERV.000026.000003</errorcode>

Looks like the ContentParts field which has all the pointers for the file is missing. Look into the Bizdoc in the pipeline to check if the field bizdoc/ContentParts is not empty(or null). If that is true, then getContentPartData will work.

Thanks. I went back over the documentation and there is a typo in the large doc handling pdf. Directions reference both a tn.tspace.loacation and a tn.space.location (no t). I had referenced the latter initially. Made the change and it works now.

One additional question. Does anybody know what/if garbage collection service is used? Documentation says that document content remains on harddisk drive space until it’s no longer being referenced and a gargabe collection routine removes the document. Just wondering what type of interval (or if there is one) that this garbage collection service runs.

Found the answer on advantage if anybody’s interested.
[url=“http://advantage.webmethods.com/article?id=1610979279”]http://advantage.webmethods.com/article?id=1610979279[/url]

Of course! we are interested.
Thanks for the information.

Hi Brian/Other webM Gurus
What do you do once you retrieve the large doc from the “tspace” using
wm.tn.doc:getContentPartData service??

I mean do you do the following?
getContentPartData (which returns partContent)
stringToDocument (which returns a node object)
getNodeIterator ( map the node object to the input of getNodeIterator)

We are doing the above steps and we get an java.lang.OutOfMemory exception.

Any suggestions are appreciated.

Thanks
Sathish.

supply ‘stream’ for the getAs parameter of getContentPartData. I think I’ve seen on other wmusers posts that folks will loop through 5Mb chunks if docs are extremely large to avoid an OutOfMemory error, but I haven’t made it that far.

[url=“http://www.wmusers.com/wmusers/messages/1825/890.shtml”]wmusers.com Mentions this.

You use the getContentPartData service to retrieve the large document as a stream object.
Once you have this, you can use any node iterator service like converToValues. This service for example will step through your file based on the toplevel nodes you have defined in your flatfile schema. This way, you avoid yourself from reading the entire stream and getting an OutOfMemory error.

Hi Bryan
What are you doing after the getContentPartData service??
It returns a ‘partContent’ stream object.
How are you processing it??

Thanks
S.

Manohar-
We have a scenerio where we need to write the content to file. So I’ve written a Java service to accept partContent (stream) and use it to create a DataInputStream. I then use this to supply to a file output stream and loop over in 1 Mb increments. I think this is analogous to what you are saying - I think our end objective is different so our means are unique to this. One thing I did notice was that the file balloons in size, though content appears to be the same. I think it’s b/c of the carriage returns. I may be using the wrong io object to write the data - still working on this. Thanks.

Sathish-
If you have an the WmTNSamples package there is another example in the wm.tn.samples.largedoc:NodeIterator service. I think this package is distributed with TN installation or upgrades.

My next big question is about the xml version tag. This requirement is a huge obstacle in implementing large doc handling - page 12 or 13 of pdf states that (and I’ve also verified in testing) any doc specified as large must have the xml-version tag as the first tag. We are part of an online business exchange (for which we aren’t the hub) and this could be a huge deal implementing at the hub as not every partner uses wM and it’s difficult for this change to be implented exchange-wide. Do you know if wM has since provided a way around this requirement.

Hi Manohar
Where is the convertToValues service? I looked for it in ISBuiltinServices.pdf but couldn’t find it.
Thanks
Sathish.

There are 2 convertToValues service one each in the WMFlatFiles and WMEDI package. Depending on which package your company has bought, you will have these services available. These are the 2 services which typically handle large flatfiles to parse the incoming file into Idata.

Manohar
Thanks for the response. But WmFlatFile package is only available in webM 6.01.

Currently I have the following flow:

getContentPartData (which returns partContent)
stringToDocument (convert partContent to a Node object)
getNodeIterator ( map the Node object to the input of getNodeIterator)

My understanding is: stringToDocument is not resource intensive. but documentToRecord is resource intensive because it involves XML parsing.

We are using webM IS 4.6. So do you think this is a good approach to use stringToDocument to convert “partContent” (Stream object) to “*node” (Node object)??

Thanks
Sathish.

Satish - Your approach is right. Ideally, we would like to use documentToRecord so that in one iteration we read the entire document into IData record. However this is not practical for a large document. Hence we use a node object of the incoming stream and an appropriate iterators specified by some repeating xml nodes.
The 2 service to read the stream using iterators will be
getNodeIterator and getNextNode.