this my question:
could i extract documents content directly using jakarta-poi api???, you doit???
The non-xml indexer internally uses Jakarta POI to extact meta-data (and not content) from MS documents. The meta-data is stored as a XML document with the same internal id (ino:id) as the non-xml document.
Hope this helps.
Software AG (UK) Ltd.
i’m trying Tamino non xml indexer, but i’m really interesting only into “generated” by the indexer.
The nonXMLIndexer generates . For Excel and MS word content POI is used internally. Of course all formatting disappears, but you can make text queries, for example "Find all word documents containing ‘Tamino’ "
i only want the content, if is possible, without using the indexer.
then you have to write your own indexer (or content extractor). What do you want to do with that content?
Full text search
Thats what the nonXMLIndexer is designed for.
My problem is metadata(xml) adding for content, i only want use one collection.
nonXML indexer writes the metadata(XML for properties AND content) in the same collection (even in the same schema) as the document itself.