Solr TikaEntityProcessor not working

I am trying to get Solr to index a database in which one column is the name of the PDF file I would like to index. My configuration looks like this:

<dataConfig>
 <dataSource name="ds-db" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/document_db" user="user" password="password" readOnly="true"/>
 <dataSource name="ds-file" type="BinFileDataSource"/>
 <document name="documents">
   <entity name="document" dataSource="ds-db" query="select * from documents">
     <entity processor="TikaEntityProcessor" url="/some/path/${document.filename}" dataSource="ds-file" format="text">
       <field column="text" />
     </entity>
   </entity>
 </document>
</dataConfig>

I use Solr from the trunk (last week). The import process completes without errors, and it extracts the columns from the database, but not the contents from the PDF file. He is definitely trying to access the PDF file because if I give him the wrong path name, he complains. It seems like he is not trying to index the PDF file since it completes after about 40 ms, whereas if I import the PDF through ExtractingRequestHandler, it takes about 11 seconds to index it.

tika -DIH, , , . - , ?

Java 1.6.0_20 OSX 10.6.3.

( , solr-user .)

+3
1

- solr-user : http://lucene.472066.n3.nabble.com/TikaEntityProcessor-not-working-tp856965p867572.html

, Apache Tika, 0.6, , -, 0.8, Solr. Tika 0.6 ( http://archive.apache.org/dist/lucene/tika/) tika-core-0.6.jar tika-parsers-0.6.jar .

+2

Source: https://habr.com/ru/post/1748049/


All Articles