Solr 3.6.1 Separate word boundaries on a dash

We have a ticket format problem with numbers, separated by a dash, i.e. n-nnnnnnn

The link http://lucidworks.lucidimagination.com/display/solr/Tokenizers (in the sections "Standard Tokenizer" and "Classic Tokenizer") implies that both before and after supporting the standard Unicode UAX # 29 application:

Words are divided into hyphens if there is no number in the word, in which case the token is not broken, and the numbers and hyphen are preserved.

Our Solr setting is used only with the StandardTokenizerFactory, but this request ticket format is broken down into dashes. I am new to solr / lucene. I downloaded the code for 3.6.1 and the comments mean the opposite (if the dotted number is still considered a number). I could not follow Lex processing:

  • The produced tokens are of the following types:
  • <ALPHANUM>: sequence of alphanumeric characters
  • <NUM>: Number
  • <SOUTHEAST_ASIAN>: sequence of characters from the south and southeast
  • Asian languages ​​including Thai, Lao, Myanmar and Khmer
  • <IDEOGRAPHIC>: CJKV single ideographic symbol
  • <HIRAGANA>: one hiragana symbol

Can anyone clarify this thanks.

+4
source share
1 answer

You need a Regular Expression Expression Regulator. This tokenizer uses the Java regular expression to split the input text stream into tokens. The expression provided by the template argument can be interpreted as a delimiter separating the tokens, or matching patterns to be extracted from the text as tokens.

For more information on Java regular expression syntax, see Javadocs for java.util.regex.Pattern .

+1
source

Source: https://habr.com/ru/post/1447570/


All Articles