I am trying to configure Lucene / Solr to correctly index words coming from news articles. As of now out of the box with Solr 3.4 WordDelimitedFilterFactory indexes:
she
be
she
but she will be the same as in "she will be," she, she, she, and the sink. I absolutely do not want this to fit the “shell”. It was not a supposed word.
Currently, my text goes through a WhitespaceTokenizerFactory, then StopFilterFactory, and then WodDelimiterFilterFactory: generateWordParts = "1" generateNumberParts = "1" catenateWords = "1" catenateNumbers = "1" catenateAll = "0" splitOnCaseChange = "0" 0lines "preserveOriginal =" 1 "types =" specialtypes.txt "
I tried to say that the apostrophe should be considered an alpha character in specialtypes.txt, saying:
' => ALPHA
but then it becomes a word, like a pony and a company, and this is not what I want.
Any ideas?
source
share