Difference between text_general and text_en in solr?

I find that I can use a different tokenizer / analyzer for different languages ​​for the text_general field.
But there is also text_en .

Why do we need two?

Suppose we have an Asian sentence, and the sentence also contains some English words. text_general used for Asian words in a sentence and text_en for English words?
How will solr index / request such offers?

+4
source share
2 answers

text_en uses an interrupt, so if you do a fakes search, you can match fake , fake's , faking , etc. With a field without a stitch, fakes will match only fakes .

Each field uses a different β€œchain” of analyzers. Text_en uses a filter chain that better indexes English. See tokenizer and filter entries.

Schema excerpt for text_general:

 <!-- A general text field that has reasonable, generic cross-language defaults: it tokenizes with StandardTokenizer, removes stop words from case-insensitive "stopwords.txt" <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> <filter class="solr.LowerCaseFilterFactory"/> 

Schema excerpt for text_en:

 <!-- A text field with defaults appropriate for English: it tokenizes with StandardTokenizer, removes English stop words (lang/stopwords_en.txt), down cases, protects words from protwords.txt, and finally applies Porter stemming. The query time analyzer also applies synonyms from synonyms.txt. --> <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"> <tokenizer class="solr.StandardTokenizerFactory"/> <!-- in this example, we will only use synonyms at query time <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> --> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" enablePositionIncrements="true" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPossessiveFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> 
+3
source

Why do we need two?

So you can analyze different materials in different ways. Or you can even analyze the same content in different ways (with copyField ) if you want. This gives you more options when querying which field you want to query in.

text_general is used for the asian words in the sentence and text_en for english words?

No, each field can have only one fieldType , like a database.

If you want to do another analysis for different languages ​​in the same field, you can see SmartChineseAnalyzer as an example.

Also see http://docs.lucidworks.com/display/LWEUG/Multilingual+Indexing+and+Search

+2
source

Source: https://habr.com/ru/post/1484946/


All Articles