Tika / PDFBox alternative for parsing PDF in Solr (any version later 1.4)

It seems that Solr is not parsing my PDF files correctly. I was wondering if there is another alternative to using Apache Tika (which I suppose uses the PDFBox inside) to parse PDF files? It seems I get random spaces between my content when using this. I highlighted the problem by executing PDF via PDFBox directly (latest version), which has the same problem.

Some commercial OCR programs, such as Omnifind, work on PDFs, but we cannot integrate them with Solr the same way, and buying is also not an option.

+4
source share
3 answers

As follows from the answer to this SO question , this is due to the nature of the PDF format itself.

It’s possible that OCR options do a better job than PDFBox, there are some free OCR options like Tesseract and Ocropus , but I have no idea how well they work or can be easily integrated with Solr.

+2
source

Xpdf contains pdftotext, which converts documents much better than Tika.

+1
source

I use jpod as a backup library for extracting from pdf when pdfbox does not work completely (freezes, crashes ...), so at least in some cases it works better than pdbbox for me.

+1
source

Source: https://habr.com/ru/post/1381447/


All Articles