Non-XML Indexer Processing of PDF File Guidlines

Irwin · June 16, 2004, 1:35pm

Hello all,

I have a enquiry.

I try a few different PDF files to see the different behaviour of the NIXE processing.

In some instances, the content of the documents were successfully extracted to the blar blar . However, in some instance it isn’t.

Is there some guidelines from the experts about what type of PDF files are extractable and what is not.

In general, when a PDF file is created. There is the option of creating a searchable PDF files or a non-searchable file. Is there a general concern about this ?

Would be good to educate the customers about the usage level in general. Thanks

Fernando_Ito1 · June 16, 2004, 9:26pm

Hi,

You will find on documentation (…\Tamino Non XML Indexer 4.1.4\Documentation\nixe\mapped-props.htm), the following message:
Note:
The Tamino Non-XML Indexer support the extraction of content and metadata from PDF files. However, in some circumstances, for example if the PDF file contains LZW compressed objects, no content information is extracted.

I hope have helped you.

Regards, Ito

M_Gesmann · July 6, 2004, 2:42pm

Hi Irwin,

LZW compression is one problem.
Another one, unfortunately not yet documented, exists with encrypted documents. There you get no content and normally useless meta information. Typically you can check whether a document is encrypted with the Acrobat Reader. Load the document and check under File - DocumentProperties - Summary. There you can find an entry “Security” which should be set to “None”.

Regards,
Michael

Topic		Replies	Views
problem with two pdf files Tamino	4	11731	April 2, 2021
Search in nonxml documents Tamino	1	4732	April 2, 2021
How does Tamino store non-XML object Tamino	6	3569	April 2, 2021
Confused about Non-XML Indexer Tamino	5	12247	April 2, 2021
NonXml objects and Tamino DB Tamino	2	6043	April 2, 2021

Non-XML Indexer Processing of PDF File Guidlines

Related topics