Removing menus from html while scanning or indexing with nutch and solr

I browse our large site with nutch and then indexing with solr, and the results are pretty good. However, the site has several menu structures that index and spoil the query results.

Each of these menus is clearly defined in the DIV, so <div id="RHBOX"> ... </div> or <div id="calendar"> ...</div> and several others.

I need to at some point delete the content in these DIVS.

I assume the right place during indexing by solr, but cannot decide how to do this.

The sample would look something like (<div id="calendar">).*?(<\/div>) , but I cannot get this to work in <tokenizer class="solr.PatternTokenizerFactory" pattern="(<div id="calendar">).*?(<\/div>)" /> and I'm not sure where to put it in schema.xml.

When I put this template in schema.xml, it does not parse.

+4
source share
4 answers

Here is a patch for SOLR that you can put in your indexing configuration to ignore the content of custom tags. However, it will only work with XML, so if you can arrange your HTML or know that it is XHTML, then it will work, but it will not work with any random HTML.

+1
source

I think you have several options:

  • Extend the Nutch HTML parser and add logic to cut the header. (There may be better places for this, for example, when you have raw data, but before analyzing the DOM)
  • Make your site smart enough not to draw a title when the daw bypasses. This is quite simple to do by simply checking the value of the User-Agent in the request header. You may need to better crop your crawl since the links in the title will not be found to help nutch find other pages.
  • Somehow getting Solr to remove the header for the nutch data. I'm not sure how you will do this, and I think it means that you are losing some of the synergies of Nutch / Solr.
  • Somehow edit the Nutch index (only the lucene index). Theoretically, you can simply view all the documents in the index and crop the correct property of each document.

I would think that the easiest way to do this is to do # 2 if you have a consistent way of drawing a title (like skin or general). Then maybe No. 1 and No. 4. I think No. 3 will be the most difficult, but I could be wrong.

+1
source

A new feature was introduced in Nutch 1.12 using the apache tika analyzer, which works on the boiler algorithm to disable header and footer content from html pages at the very parsing stage.

We can use the following properties in nutch-site.xml so that this is implemented:

 <!-- parse-tika plugin properties --> <property> <name>tika.extractor</name> <value>boilerpipe</value> <description> Which text extraction algorithm to use. Valid values are: boilerpipe or none. </description> </property> <property> <name>tika.extractor.boilerpipe.algorithm</name> <value>DefaultExtractor</value> <description> Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, ArticleExtractor or CanolaExtractor. </description> </property> 

He works for me. Hope this works for others too ... :)

For a detailed review, you can refer to this ticket: https://issues.apache.org/jira/browse/NUTCH-961

+1
source

If you want to do this, I believe that you should write a custom parser in nutch so that the data in the index does not contain data. Basically, after parsing text data, it is raw text without any structure.

0
source

Source: https://habr.com/ru/post/1347498/


All Articles