One of my clients is a photo agency specializing in photojournalism (and gossip), which is why many searches for their clients revolve around specific people.
We index about 1.5 m of documents, with full-text search in the title and title; and full-text search without tags. We have a decent list of stop words, and they provide a list of protected words that, in their opinion, are not resolved correctly. We use Dismax to search for titles, headers and tags with various enhancements) All this works great.
However, several people turned out to be tricky to get right. For example, Al Gore. In Italian "al", the word stops, so a simple query for "al gore" (without quotes) becomes:
+((DisjunctionMaxQuery((caption_text:gor | tags_text:gore^100.0 | headline_text:gor)))~1) ()
This brings back the looks for the former VP, but of course also for Leslie Gore and Tipper Gore; as well as because of the stem, hits for "Burn", etc. Leaving aside sorting for a second, it clutters up the results, and I would like to do better.
Wrapping the search terms in quotation marks does not help, "al" is still deleted. Marking "gore" as a protected word gets me halfway, limiting the number of false positives. I also tried to play with SynonymFilterFactory, but did not go too far - I have SynonymFilterFactory as the first filter, so "al" is deleted anyway.
, "al gore" . -, , ""?
, ? solr.CommonGramsFilterFactory ?
: Solr 1.4.0.
schema.xml
<!-- used for headline and caption -->
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.it.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Italian" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.it.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Italian" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="tagsText" class="solr.TextField" sortMissingLast="true" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.it.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.it.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>