Difference Between WhitespaceTokenizerFactory and StandardTokenizerFactory

I am new to Solr. Reading the Solr wiki, I don't understand the differences between the WhitespaceTokenizerFactory and StandardTokenizerFactory. What is their real difference?

+9
source share
1 answer

They differ in how they divide the analyzed text into tokens.

StandardTokenizer does this based on the following (taken from lucene javadoc):

  • Separates words in punctuation characters by removing punctuation marks. However, a dot not followed by spaces is considered part of the token.
  • Separates words in hyphens if there is no number in the token, in which case the entire token is interpreted as the product number and is not divided.
  • Recognizes email addresses and Internet host names as one token.

WhitespaceTokenizer does this based on space characters:

A WhitespaceTokenizer is a tokenizer that divides text into spaces. Adjacent character sequences without spaces form tokens.

You must choose the tokenizer that best suits your application. In any case, you need to use the same analyzers / tokenizers for indexing and searching!

+25
source

Source: https://habr.com/ru/post/918907/


All Articles