I use Nutch to crawl websites, and I want to parse specific sections of the html pages crawled by Nutch. For instance,
<h><title> title to search </title></h> <div id="abc"> content to search </div> <div class="efg"> other content to search </div>
I want to parse a div element with id = "abc" and class = "efg" and so on.
I know that I need to create a plugin for personalized parsing, as the htmlparser plugin provided by Nutch removes all html tags, css and javascript content and leaves only text content. I referred to this blog http://sujitpal.blogspot.in/2009/07/nutch-custom-plugin-to-parse-and-add.html , but I found it to parse with the html tag, whereas I I want to parse html tags with an attribute having a specific value, I found Jericho to be mentioned as useful for parsing specific html tags, but I could find any example for the nutch plugin related to Jericho.
I need some guidance on how to develop a strategy for parsing html pages based on tags with an attribute that has a specific value.
source share