Extract text from MS Word Document

Maria_Elena · March 22, 2005, 7:57pm

Hi,

I uploaded a MS Word Document, using Tamino 4.1.4 and Tamino API for Java.

I already have the document into a collection, but i need to search according the text inside the document.

How can i do this? Is there any utility? or… Do i have to extract the text of the document and add it like another element in a schema ???

Thanks in advance,

Rob_Gibson · June 10, 2005, 6:31am

Hi Maria,

You don’t say how you loaded the Word doc, as non-XML or XML (“save-as XML” option from Word (2003)). I assume you loaded it/them as non-XML.

This is resolved with Tamino 4.2.1. There is a non-xml indexer server extension included with v4.2.1 that allows the loading of non-XML data, including Word docs. This server extension creates a shadow document where the content (meta data) of the non-XML Word doc is extracted to.

When you search the non-XML document you are really searching the shadow document (an XML doc). The two documents act as one. All searchs go against the shadow doc and then when you call the Word doc (using ino:id or ino:docname) it returns the non-XML (Word) document.

I know I have not answered you question in regards to 4.1.4 but wanted to let you know what is available, even if it is for v4.2.1, encase you can upgrade.

Topic		Replies	Views
Search in nonxml documents Tamino	1	4732	April 2, 2021
Search documents of a WebDAV Server Tamino	2	4820	April 2, 2021
Word as XML in webDAV Tamino	6	10718	April 2, 2021
Microsoft Word documents within Tamino? Tamino	3	3402	April 2, 2021
Can't load Word document Tamino	6	12784	April 2, 2021

Extract text from MS Word Document

Related topics