Getting started with Solr

I'm trying to get started with Apache Solr, but some things are not clear to me. Reading through the tutorial , I created a running instance of Solr. I am confused that the entire Solr configuration (schemas, etc.) is in XML format. When they add sample data, it shows how to add XML documents ( java -jar post.jar solr.xml monitor.xml ). Is this just a poor selection of sample format? I mean, do they upload data describing documents, or are the actual documents they add are .xml files?

I am trying to add some books in .txt format, so if I use java -jar post.jar mydoc.txt , do I add it? How can I add this document and metadata (author, title) about it?

However, I tried to create a simple Html page to send documents to Solr:

 <html> <head></head> <body> <form action="http://localhost:8983/solr/update?commit=true" enctype="multipart/form-data" method="post"> <input type="file"> <input type="submit" value="Send"> </form> </body> </html> 

When I try to publish a file, I get this answer:

 <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">26</int> </lst> </response> 

It is right? Does this mean that I have successfully added the file? If so, one of the words in the file, for example, "montagna" (this is an Italian book, montagna means mountain ...). If I find the url

 http://localhost:8983/solr/select/?q=montagna&start=0&rows=10&indent=on 

I expect something to be returned (maybe all the text or some information about the file), but this is what I get:

 <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">1</int> <lst name="params"> <str name="indent">on</str> <str name="start">0</str> <str name="q">montagna</str> <str name="rows">10</str> </lst> </lst> <result name="response" numFound="0" start="0"/> </response> 

Not like coincidence with me. Also, according to this answer , I should be able to return the text associated with the match with hl.fragsize . How to integrate this into the search bar? Thanks you

+6
source share
2 answers

The solr example adds documents to the index through xml messages. Take a look here . *.xml you mentioned, because some XML messages are stored in file systems. These xml messages are as follows:

 <add> <doc> <field name="id">UTF8TEST</field> <field name="name">Test with some UTF-8 encoded characters</field> <field name="manu">Apache Software Foundation</field> <field name="cat">software</field> <field name="cat">search</field> <field name="features">No accents here</field> <field name="price">0</field> <!-- no popularity, get the default from schema.xml --> <field name="inStock">true</field> </doc> </add> 

This is just a way to submit any document for indexing. Each document contains one or more fields, etc. There are different ways to add documents to Solr, for example, it also accepts the CSV format , but the most common is currently the XML format.

I think you are not actually indexing anything. You can check the output of this query: http://localhost:8983/solr/select/?q=*:* , which retrieves all the documents that you have in your index. A common mistake also is to forget to commit, but I saw that you added the commit=true parameter to your url, not for your case.

If you want to index only the contents of a text file, you can, for example, define your schema with two fields:

  • file name
  • Content

and use this message to index your document:

 <add> <doc> <field name="filename">test.txt</field> <field name="content">Test with some UTF-8 encoded characters</field> </doc> </add> 
+5
source

Understand the terminology:

 Document in solr -> Row in RDBMS Field of document -> Column of a cell 

And the Solr core, of course, is both a database and a giant table, occupied in (potentially) sparse ways.

For your (specific) use, you must create a document for each file; consisting of identifier, file contents, etc.


XML is one way to compile solr operations. http://wiki.apache.org/solr/UpdateXmlMessages

It performs operations of adding, deleting, fixing and optimization. An add operation includes one or more documents.

 <add> <doc> <field name="employeeId">05991</field> <field name="office">Bridgewater</field> <field name="skills">Perl</field> <field name="skills">Java</field> </doc> [<doc> ... </doc>[<doc> ... </doc>]] </add> 

There are also CSV (only for adding functions), JSON (full functionality), DIH (import of scheduled databases).

There is also a query handler extraction that can extract content (and metadata) from all kinds of rich documents (DOC, DOCX, PDF). Additionally: there is literal to set its own fields.


The retrieval request handler stores its output in the text field. The q= query parser and the marker assume the default field (yes, this applies to what you did) text . You can specify fields for them; also the solr fields return to you in the results.

+1
source

Source: https://habr.com/ru/post/907753/


All Articles