I browse our large site with nutch and then indexing with solr, and the results are pretty good. However, the site has several menu structures that index and spoil the query results.
Each of these menus is clearly defined in the DIV, so <div id="RHBOX"> ... </div> or <div id="calendar"> ...</div> and several others.
I need to at some point delete the content in these DIVS.
I assume the right place during indexing by solr, but cannot decide how to do this.
The sample would look something like (<div id="calendar">).*?(<\/div>) , but I cannot get this to work in <tokenizer class="solr.PatternTokenizerFactory" pattern="(<div id="calendar">).*?(<\/div>)" /> and I'm not sure where to put it in schema.xml.
When I put this template in schema.xml, it does not parse.
source share