We have a ticket format problem with numbers, separated by a dash, i.e. n-nnnnnnn
The link http://lucidworks.lucidimagination.com/display/solr/Tokenizers (in the sections "Standard Tokenizer" and "Classic Tokenizer") implies that both before and after supporting the standard Unicode UAX # 29 application:
Words are divided into hyphens if there is no number in the word, in which case the token is not broken, and the numbers and hyphen are preserved.
Our Solr setting is used only with the StandardTokenizerFactory, but this request ticket format is broken down into dashes. I am new to solr / lucene. I downloaded the code for 3.6.1 and the comments mean the opposite (if the dotted number is still considered a number). I could not follow Lex processing:
- The produced tokens are of the following types:
- <ALPHANUM>: sequence of alphanumeric characters
- <NUM>: Number
- <SOUTHEAST_ASIAN>: sequence of characters from the south and southeast
- Asian languages ββincluding Thai, Lao, Myanmar and Khmer
- <IDEOGRAPHIC>: CJKV single ideographic symbol
- <HIRAGANA>: one hiragana symbol
Can anyone clarify this thanks.
source share