Search by Solr type by part number

I'm having some difficulties with how to build a Solr request, or how to configure a scheme to improve the search in our online store.

First configuration (Solr 4.2.1)

<field name="mfgpartno" type="text_en_splitting_tight" indexed="true" stored="true" /> <field name="mfgpartno_sort" type="string" indexed="true" stored="false" /> <field name="mfgpartno_search" type="sku_partial" indexed="true" stored="true" /> <copyField source="mfgpartno" dest="mfgpartno_sort" /> <copyField source="mfgpartno" dest="mfgpartno_search" /> <fieldType name="sku_partial" class="solr.TextField" omitTermFreqAndPositions="true"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="1" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.TrimFilterFactory" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/> <filter class="solr.NGramFilterFactory" minGramSize="4" maxGramSize="100" side="front" /> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.TrimFilterFactory" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/> </analyzer> </fieldType> 

Let me break this down into several stages (I will only get together to reproduce the problem - the initial stages do not use edismax, which is exactly what we decided to use on our website):

  • q=DV\-5PBRP <- with this query I get 18 results, but not the one I'm looking for (this most likely refers to the default search df in the productname field - great)
  • q=mfgpartno_search:DV\-5PBRP <- this gives me 1 result that I am looking for, but because of building the query I need to do better on the website if I can use the q parameter, for example, step 1.
  • q=DV\-5PBRP&defType=edismax&qf=mfgpartno_search <- this also gives me the 1 result I'm looking for, but again, because of the qf website search, more fields need to be covered. Since searching for more precise searches requires searching for more fields (actual qf = productname_search shortdesc_search fulldesc_search mfgpartno_search productname shortdesc fulldesc keywords ), I implemented step 4.
  • q=DV\-5PBRP&defType=edismax&qf=mfgpartno_search&q.op=AND <- with this test I get 0 results - although this is great for most queries on our site.

My biggest search problem is special characters like dashes, which sometimes have to be literal and sometimes act as delimiters, like in product names or descriptions. Sometimes people even look for or replace the dash with a space in the search for the part number, and it should still display relevant data.

I kind of went in cycles on how to achieve this special search of characters, especially if it belongs to this field mfgpartno_search. How can I tweak a schema or query (or both) to make this work?

+6
source share
3 answers

Well, I think the problem has been changed.

I assumed (based on my configuration) that the approximate part number could be indexed as follows:

 DV-5PBRP -> {DV 5PBRP, DV5PBRP, DV-5PBRP} + NGrams 

I also suggested that searching for β€œDV-5PBRP” (literal dash) would match this third option (using a query like # 4 in my question).

Yesterday I was again warned about this problem by the same user, and I thought, let him try to remove the delimiter in the search. So now the search has become:

 q=DV5PBRP&defType=edismax&qf=mfgpartno_search&q.op=AND 

I got the result I was looking for, which means that my solr configurator at least gives me an index, such as the second index parameter.

Now I started trimming separator characters with user input before submitting the search to SOLR. It seems to work great!

0
source

Perhaps you could try the Regular Expression Tokenizer and make a suitable regular expression for your article numbers. Lucene (on which Solr is built) is very focused on tokenization for prose.

What do you want here, probably this is an N-gram split as well as 1 gram? And it’s possible that dashes are replaced by spaces, something like

DV-5PBRP β†’ {DV 5PBRP, DV, 5P, BR, PB, RP, D, V, 5, P, B, R}

As you can see, the index will be quite large for very small fields. Make sure the ranking of the results is heavily weighted for large ngrams.

I really think you should remove the stop word list for the article number field.

The size of N-grams should start with 1 or 2.

Just make sure that different analyzers do not:

  • learn dash
  • delete one or more characters (they are often found in stop-word lists)
  • deletes numbers
+1
source

If you use the HTTP get method, please encode it and send using

 URLEncoder.encode(searchWord,"UTF-8") 

This is the case with java. If you are not using java, try the appropriate encoding code. This will help us to avoid "space", "/" like problems

0
source

Source: https://habr.com/ru/post/986325/


All Articles