Non-XML Indexer Processing of PDF File Guidlines

Hello all,

I have a enquiry.

I try a few different PDF files to see the different behaviour of the NIXE processing.

In some instances, the content of the documents were successfully extracted to the blar blar . However, in some instance it isn’t.

Is there some guidelines from the experts about what type of PDF files are extractable and what is not.

In general, when a PDF file is created. There is the option of creating a searchable PDF files or a non-searchable file. Is there a general concern about this ?

Would be good to educate the customers about the usage level in general. Thanks

Hi,

You will find on documentation (…\Tamino Non XML Indexer 4.1.4\Documentation\nixe\mapped-props.htm), the following message:
Note:
The Tamino Non-XML Indexer support the extraction of content and metadata from PDF files. However, in some circumstances, for example if the PDF file contains LZW compressed objects, no content information is extracted.


I hope have helped you. :wink:

Regards, Ito

Hi Irwin,

LZW compression is one problem.
Another one, unfortunately not yet documented, exists with encrypted documents. There you get no content and normally useless meta information. Typically you can check whether a document is encrypted with the Acrobat Reader. Load the document and check under File - DocumentProperties - Summary. There you can find an entry “Security” which should be set to “None”.

Regards,
Michael