How to analyze content located in certain HTML tags using nutch plugin?

Question

How to analyze content located in certain HTML tags using nutch plugin?

I use Nutch to crawl websites, and I want to parse specific sections of the html pages crawled by Nutch. For instance,

<h><title> title to search </title></h> <div id="abc"> content to search </div> <div class="efg"> other content to search </div>

I want to parse a div element with id = "abc" and class = "efg" and so on.

I know that I need to create a plugin for personalized parsing, as the htmlparser plugin provided by Nutch removes all html tags, css and javascript content and leaves only text content. I referred to this blog http://sujitpal.blogspot.in/2009/07/nutch-custom-plugin-to-parse-and-add.html , but I found it to parse with the html tag, whereas I I want to parse html tags with an attribute having a specific value, I found Jericho to be mentioned as useful for parsing specific html tags, but I could find any example for the nutch plugin related to Jericho.

I need some guidance on how to develop a strategy for parsing html pages based on tags with an attribute that has a specific value.

+6

nutch

abhijeet Jul 31 '13 at 14:02

source share

1 answer

tahagh · Answer 1 · 2013-12-18T12:08:42+0000

You can use this plugin to extract data from your pages based on css rules:

https://github.com/BayanGroup/nutch-custom-search

In your example, you can configure it as follows:

 <config> <fields> <field name="custom_content" /> </fields> <documents> <document url=".+" engine="css"> <extract-to field="custom_content"> <text> <expr value="#abc" /> </text> <text> <expr value=".efg" /> </text> </extract-to> </document> </documents> </config>

How to analyze content located in certain HTML tags using nutch plugin?

More articles: