Tika / PDFBox alternative for parsing PDF in Solr (any version later 1.4)

Question

Tika / PDFBox alternative for parsing PDF in Solr (any version later 1.4)

It seems that Solr is not parsing my PDF files correctly. I was wondering if there is another alternative to using Apache Tika (which I suppose uses the PDFBox inside) to parse PDF files? It seems I get random spaces between my content when using this. I highlighted the problem by executing PDF via PDFBox directly (latest version), which has the same problem.

Some commercial OCR programs, such as Omnifind, work on PDFs, but we cannot integrate them with Solr the same way, and buying is also not an option.

+4

pdfbox solr apache-tika full-text-indexing document-conversion

Ravish bhagdev Nov 16 '11 at 9:14

source share

3 answers

Xpdf contains pdftotext, which converts documents much better than Tika.

+1

Okke klein Nov 16 '11 at 15:02

source share

I use jpod as a backup library for extracting from pdf when pdfbox does not work completely (freezes, crashes ...), so at least in some cases it works better than pdbbox for me.

+1

Persimmonium Nov 16 '11 at 15:05

source share

Tom de leu · Accepted Answer · 2011-11-16T11:00:09+0000

As follows from the answer to this SO question , this is due to the nature of the PDF format itself.

It’s possible that OCR options do a better job than PDFBox, there are some free OCR options like Tesseract and Ocropus , but I have no idea how well they work or can be easily integrated with Solr.

Tika / PDFBox alternative for parsing PDF in Solr (any version later 1.4)

More articles: