Storing MS Office and Adobe Acrobat Files

I’ve heard that it’s possible to store Office and PDF Files as XML in Tamino.
Is this true?
Are they converted automatically by Tamino or is needed an external utilty?
If, it’s possible, How?



I am also racking my brains over this puzzle. Unfortunately Tamino is not complemented with convertors from file formats other than XML. Some Office components are able to export their data to XML (Excel XP and Access XP), but this is not true for Word documents which I am mostly interested in.

I am working on a prototype of document management system. Its simple idea is to store search image in XML format with origin documents. So I can use all Tamino query capabilities for semantic search.

There are two main challenges that don’t allow me to sleep. They demand different types of convertors.

1. To import existing Word documents that are not well structured. I don’t think this can be done automatically. It would be great if a tool provides end user (maybe with no knowledge of XML) with wizards and interfaces to map document items to schema elements and attributes.

2. To develop a new format, so that transformation to XML can be performed automatically. I expect that a tool allows to define templates that describe how documents should be converted.

So I join JimC’s question. Does anyone have experience to import different file types to XML format. Maybe you could suggest some third party convertors?

You can use iFilter technology:

You can extract the only text, buil an XML and store it in Tamino.

With iFilterdump program you can process .doc, .xls, .pdf, .html, etc.