How to analyze content located in certain HTML tags using nutch plugin?

I use Nutch to crawl websites, and I want to parse specific sections of the html pages crawled by Nutch. For instance,

<h><title> title to search </title></h> <div id="abc"> content to search </div> <div class="efg"> other content to search </div> 

I want to parse a div element with id = "abc" and class = "efg" and so on.

I know that I need to create a plugin for personalized parsing, as the htmlparser plugin provided by Nutch removes all html tags, css and javascript content and leaves only text content. I referred to this blog http://sujitpal.blogspot.in/2009/07/nutch-custom-plugin-to-parse-and-add.html , but I found it to parse with the html tag, whereas I I want to parse html tags with an attribute having a specific value, I found Jericho to be mentioned as useful for parsing specific html tags, but I could find any example for the nutch plugin related to Jericho.

I need some guidance on how to develop a strategy for parsing html pages based on tags with an attribute that has a specific value.

+6
source share
1 answer

You can use this plugin to extract data from your pages based on css rules:

https://github.com/BayanGroup/nutch-custom-search

In your example, you can configure it as follows:

 <config> <fields> <field name="custom_content" /> </fields> <documents> <document url=".+" engine="css"> <extract-to field="custom_content"> <text> <expr value="#abc" /> </text> <text> <expr value=".efg" /> </text> </extract-to> </document> </documents> </config> 
+5
source

Source: https://habr.com/ru/post/950720/


All Articles