MS Word document conversion to Text

Does anyone have any experience converting a MS Word document to Text format in the Integration Server? The document will be sent as an email attachment and picked up by the Integration Server via the Email port.

Let me know if you have any hints on how to do this conversion in Integration Server 4.6.

Cathy,

To the best of my knowledge, word processor document type conversion is not part of the webMethods offering.

I would recommend re-assesing the requirements of your technical solution if at all possible. Even with a reliable conversion tool (which theoretically could be invoked from an IS), converting to text from a word document will inevitably result in less than perfect results (esp. documents with complex formatting).

If your requirements allow, I would suggest the email you generate should include the original attachment. We’ve tried to do what you describe on our corporate web site and weren’t particularly happy with the results.

A good place to start looking for conversion tools would be the Microsoft website. A search for “word to text conversion tools” returned me quite a few results.

I’d love to hear if anyone else has done this successfully in their own organization(s).

-jmh

I realize that webMethods does not support this out of the box. So, I am looking for a solution that can be implemented in Java. The word document I am trying to convert is very simple, but the requirements cannot be changed.

I have been looking into the Jacob project, but the Java SDK seems to be currently unsupported by Microsoft, so I am hesistent to use it. Another possibility is the Jakarta POI, but it looks like the POI Word support is in the beginning stages.

Let me know if anyone has any other ideas.

Hi Cathy,

If you have a tool that can be invoked via command line interface, you always have the ablitiy to obtain a reference to the JVM runtime and “exec” an OS-specific command. The tool does not need to be written in Java or provide a Java API if it can be invoked from command line.

Now I’ve never tried this within a JVM running wM IS, but I don’t see why this isn’t possible. I’ve done it a million times in JVMs running app servers or client applications. A quick search on this newsgroup returned some sample java code which is doing just that:

http://www.wmusers.com/wmusers/messages/117/306.shtml

If you can install the conversion tool on your IS system, you can invoke a script which calls the converter and either writes the results to a file or back to the output stream.

If you can’t install the conversion tool on your IS system, you’ll probably have to do some additional coding or utilize a mechanism for making a remote procedure call.

Cathy,

Following on the previous response to your question, I know that VB calls can be made from the command line. If you program a macro within the document to parse it into text you should then be able to call the macro from the command line to convert the document. This might be useful if you can’t find a conversion utility.

Cathy,
http://jakarta.apache.org/poi/index.html

    I'm currently using the HSSF library to produce excel spreadsheets in I.S.. You can find an API here to produce .doc. You'll need to know Java. 
     
    Yes, it sounds a little odd to use I.S. to produce something as simple as an .xls or a .doc; yet, some of our customers want this, at least to help them begin with their integration aims ... 

    I'll let people know how I find the POI library when this section of the code is pushed into production. 

Nick

Cathy!

There are a lot of open source tools available for
extracting and processing word and other proprietary
formats.

One example is “catdoc” found on the following URL:

http://www.45.free.net/~vitus/ice/catdoc/

Most of them are written in C or C++, but they may
be used in WM using JNI (Java Native Interfacing).
All you have to do is write a little stub.

I have tested this in WM, and it works, but you may need
to use another Java VM, if the one shipped with WM
crashes when performing JNI.

The advantange of this technique is that it is quite
plattform independent as it will work on any platform
supporting a decent GCC (GNU C Compiler)

This may sound a little tricky, but give it a try, it will
open lots of new opportunities for you!

BR
/Erik

Hi Cathy,

The problem you have just described can be easily solved using Itemfield’s ContentMaster. ContentMaster is fully equipped to parse/serialize any binary/textual format (one of which is MS Word) into any other format.

Itemfield is a new webMethods partner. Its unique “Example-Based Parsing” technology offers an intuitive graphical development environment for easy creation of parser scripts.

ContentMaster is easily integrated into WM. See http://www.itemfield.com/solutions/sol_webmeth.shtml for more details.

Please feel free to contact me with any questions.

Meitav Harpaz