The hyphen / dash task in Solr Lucene

I'm trying to get Solr to extract only the second 7-digit part of the ticket, formatted as n-nnnnnnn

Initially, I was hoping to keep the full ticket together. According to the documentation, numbers with numbers should be stored together, but after you removed this problem for some time and looked at the code, I do not think this is the case. Solr always generates two members. Therefore, instead of a large number of matches for the first digit, n-I think that I can get the best query results only from the second part. Substituting A for the dash:

<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\b\d[A](\d\d\d\d\d\d\d)\b" replacement="$1" replace="all" maxBlockChars="20000"/> 

will parse 1A1234567 fine But - \ b "replacement =" $ 1 "replace =" all "maxBlockChars =" 20000 "/">

will not analyze 1-1234567

So this seems like a hyphen problem. I tried - (shielded) and [-] and \ u002D and \ x {45} and \ x045 with no success.

I tried putting char filters around it:

  <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\b\d[-](\d\d\d\d\d\d\d)\b" replacement="$1" replace="all" maxBlockChars="20000"/> <charFilter class="solr.MappingCharFilterFactory" mapping="mapping2.txt"/> 

with mappings:

"-" => "z"

and then

"z" => "-"

It seems to me that the hyphen is eaten up in Flex token and is not even available for the char filter.

Has anyone had more success with a hyphen / dash in Solr / Lucene? Thanks

+1
source share
1 answer

If your Solr uses recent Lucene (I think 3.x +), you will want to use ClassicAnalyzer, not StandardAnalyzer, since StandardAnalyzer now always treats hyphens as a delimiter.

+3
source

Source: https://habr.com/ru/post/1447572/


All Articles