Extract text from MS Word Document

Hi,

I uploaded a MS Word Document, using Tamino 4.1.4 and Tamino API for Java.

I already have the document into a collection, but i need to search according the text inside the document.

How can i do this? Is there any utility? or… Do i have to extract the text of the document and add it like another element in a schema ???

Thanks in advance,

:wink:

Hi Maria,

You don’t say how you loaded the Word doc, as non-XML or XML (“save-as XML” option from Word (2003)). I assume you loaded it/them as non-XML.

This is resolved with Tamino 4.2.1. There is a non-xml indexer server extension included with v4.2.1 that allows the loading of non-XML data, including Word docs. This server extension creates a shadow document where the content (meta data) of the non-XML Word doc is extracted to.

When you search the non-XML document you are really searching the shadow document (an XML doc). The two documents act as one. All searchs go against the shadow doc and then when you call the Word doc (using ino:id or ino:docname) it returns the non-XML (Word) document.

I know I have not answered you question in regards to 4.1.4 but wanted to let you know what is available, even if it is for v4.2.1, encase you can upgrade.