It seems that Solr is not parsing my PDF files correctly. I was wondering if there is another alternative to using Apache Tika (which I suppose uses the PDFBox inside) to parse PDF files? It seems I get random spaces between my content when using this. I highlighted the problem by executing PDF via PDFBox directly (latest version), which has the same problem.
Some commercial OCR programs, such as Omnifind, work on PDFs, but we cannot integrate them with Solr the same way, and buying is also not an option.
source share