Correct word splitting with apostrophes in Solr

I am trying to configure Lucene / Solr to correctly index words coming from news articles. As of now out of the box with Solr 3.4 WordDelimitedFilterFactory indexes: she be she

but she will be the same as in "she will be," she, she, she, and the sink. I absolutely do not want this to fit the “shell”. It was not a supposed word.

Currently, my text goes through a WhitespaceTokenizerFactory, then StopFilterFactory, and then WodDelimiterFilterFactory: generateWordParts = "1" generateNumberParts = "1" catenateWords = "1" catenateNumbers = "1" catenateAll = "0" splitOnCaseChange = "0" 0lines "preserveOriginal =" 1 "types =" specialtypes.txt "

I tried to say that the apostrophe should be considered an alpha character in specialtypes.txt, saying: ' => ALPHA
but then it becomes a word, like a pony and a company, and this is not what I want.

Any ideas?

+4
source share
2 answers

protected = "protwords.txt" solr.WordDelimiterFilterFactory: https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

, ( , , , ). - , .

+1

solr.ApostropheFilterFactory

:

( ).

: "Türkiye'de", "2003'te" == > "Türkiye", "2003".

+2

Source: https://habr.com/ru/post/1541164/


All Articles