Correct word splitting with apostrophes in Solr

Question

Correct word splitting with apostrophes in Solr

I am trying to configure Lucene / Solr to correctly index words coming from news articles. As of now out of the box with Solr 3.4 WordDelimitedFilterFactory indexes: she be she

but she will be the same as in "she will be," she, she, she, and the sink. I absolutely do not want this to fit the “shell”. It was not a supposed word.

Currently, my text goes through a WhitespaceTokenizerFactory, then StopFilterFactory, and then WodDelimiterFilterFactory: generateWordParts = "1" generateNumberParts = "1" catenateWords = "1" catenateNumbers = "1" catenateAll = "0" splitOnCaseChange = "0" 0lines "preserveOriginal =" 1 "types =" specialtypes.txt "

I tried to say that the apostrophe should be considered an alpha character in specialtypes.txt, saying: ' => ALPHA
but then it becomes a word, like a pony and a company, and this is not what I want.

Any ideas?

+4

lucene solr

Uberbug May 19, '14 at 19:10

source share

2 answers

solr.ApostropheFilterFactory

:

( ).
: "Türkiye'de", "2003'te" == > "Türkiye", "2003".

+2

notdang 20 '14 2:28

John Petrone · Accepted Answer · 2014-05-20T18:36:55+0000

protected = "protwords.txt" solr.WordDelimiterFilterFactory: https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

, ( , , , ). - , .

Correct word splitting with apostrophes in Solr

More articles: