Removing menus from html while scanning or indexing with nutch and solr

Question

Removing menus from html while scanning or indexing with nutch and solr

I browse our large site with nutch and then indexing with solr, and the results are pretty good. However, the site has several menu structures that index and spoil the query results.

Each of these menus is clearly defined in the DIV, so <div id="RHBOX"> ... </div> or <div id="calendar"> ...</div> and several others.

I need to at some point delete the content in these DIVS.

I assume the right place during indexing by solr, but cannot decide how to do this.

The sample would look something like (<div id="calendar">).*?(<\/div>) , but I cannot get this to work in <tokenizer class="solr.PatternTokenizerFactory" pattern="(<div id="calendar">).*?(<\/div>)" /> and I'm not sure where to put it in schema.xml.

When I put this template in schema.xml, it does not parse.

+4

design-patterns solr nutch

hayres Apr 11 '11 at 6:06

source share

4 answers

I think you have several options:

Extend the Nutch HTML parser and add logic to cut the header. (There may be better places for this, for example, when you have raw data, but before analyzing the DOM)
Make your site smart enough not to draw a title when the daw bypasses. This is quite simple to do by simply checking the value of the User-Agent in the request header. You may need to better crop your crawl since the links in the title will not be found to help nutch find other pages.
Somehow getting Solr to remove the header for the nutch data. I'm not sure how you will do this, and I think it means that you are losing some of the synergies of Nutch / Solr.
Somehow edit the Nutch index (only the lucene index). Theoretically, you can simply view all the documents in the index and crop the correct property of each document.

I would think that the easiest way to do this is to do # 2 if you have a consistent way of drawing a title (like skin or general). Then maybe No. 1 and No. 4. I think No. 3 will be the most difficult, but I could be wrong.

+1

mlathe Sep 26 '11 at 17:16

source share

A new feature was introduced in Nutch 1.12 using the apache tika analyzer, which works on the boiler algorithm to disable header and footer content from html pages at the very parsing stage.

We can use the following properties in nutch-site.xml so that this is implemented:

 <!-- parse-tika plugin properties --> <property> <name>tika.extractor</name> <value>boilerpipe</value> <description> Which text extraction algorithm to use. Valid values are: boilerpipe or none. </description> </property> <property> <name>tika.extractor.boilerpipe.algorithm</name> <value>DefaultExtractor</value> <description> Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, ArticleExtractor or CanolaExtractor. </description> </property>

He works for me. Hope this works for others too ... :)

For a detailed review, you can refer to this ticket: https://issues.apache.org/jira/browse/NUTCH-961

+1

Techguy Aug 16 '16 at 19:01

source share

If you want to do this, I believe that you should write a custom parser in nutch so that the data in the index does not contain data. Basically, after parsing text data, it is raw text without any structure.

0

millebii Apr 11 '11 at 19:19

source share

Mike sokolov · Accepted Answer · 2011-09-26T17:52:31+0000

Here is a patch for SOLR that you can put in your indexing configuration to ignore the content of custom tags. However, it will only work with XML, so if you can arrange your HTML or know that it is XHTML, then it will work, but it will not work with any random HTML.

Removing menus from html while scanning or indexing with nutch and solr

More articles: