Solr ExtractingRequestHandler gives empty content for PDF documents

I use ExtractingRequestHandler in Solr to get the contents of a document and index it. It works great for all Microsoft documents, but the extracted content is empty for PDF files. I also tried extractOnly = true with curl, and this also only returns an empty body.

I used TIKA myself on the same documents and the content retrieved just fine. The difference is that when I use it myself, I use the BodyContentHander, which comes with Tika instead of the SolrContentHandler, which is used by Solr. Has anyone seen this?

I'd rather Solr handle this than I would, using Tika to fetch content outside of Solr.

+3
source share
1 answer

I just dealt with this problem for several hours before finding out - I opened my PDF files in binary mode and submitted them to solr only until the first EOF character in the file. Solr will still extract metadata from the file (as it appears in the PDF header), but will return an empty body tag in response.

This may not apply to the original poster, but it can really help someone else spend a lot of time on their life.

+1
source

Source: https://habr.com/ru/post/1726940/


All Articles