I use ExtractingRequestHandler in Solr to get the contents of a document and index it. It works great for all Microsoft documents, but the extracted content is empty for PDF files. I also tried extractOnly = true with curl, and this also only returns an empty body.
I used TIKA myself on the same documents and the content retrieved just fine. The difference is that when I use it myself, I use the BodyContentHander, which comes with Tika instead of the SolrContentHandler, which is used by Solr. Has anyone seen this?
I'd rather Solr handle this than I would, using Tika to fetch content outside of Solr.
aseem source
share