How to parse html with nut and index of a specific tag on solr?

I installed nutch and solr to scan the site and search in it; as you know, we can index meta tags of web pages in solr using syntax meta tag plugins (http://wiki.apache.org/nutch/IndexMetatags). Now I want to know if there is a way to scan another html solr tag that is not meta (plugin or anyway) as follows:

<div id=something> me specific tag </div> 

indeed, I want to add a field to solr (something) that has the value "me specific tag" on this page.

any idea?

+4
source share
4 answers

I made my own plugin for something like this. The configuration file for displaying the NutchDocument in SolrDocument is in $ NUTCH_HOME / conf / solrindex-mapping.xml . Here you can add your own tags. But still you have to fill in your own tags somewhere.

Here are some tips for the plugin:

  • read http://wiki.apache.org/nutch/WritingPluginExample , here you can find how to make your plugin very simple
  • in your plugin, expand ParseFilter and IndexingFilter.
  • in YourParseFilter you can use NodeWalker to find your specific div
  • your parsed data is placed in page metadata like this

    page.putToMetadata(new Utf8("yourKEY"), ByteBuffer.wrap(YourByteArrayParsedFromMetaData));

  • in YourIndexingFilter add the metadata from the page (page.getMetadata) to the NutchDocument

    doc.add("your_specific_tag", value);

  • the most important!!!!!

  • put your_specific_tag in the files:

    • Solr schema.xml configuration file (and restart Solr)

    field name = "your_specific_tag" type = "string" stored = "true" indexed = "true"

    • Nutch schema.xml configuration file (I don't know if this is really necessary)
    • Nutch solrindex-mapping.xml configuration file

    field dest = "your_specific_tag" source = "your_specific_tag"

+3
source

u just need to try http://lifelongprogrammer.blogspot.in/2013/08/nutch2-crawl-and-index-extra-tag.html in the tutorial it says that the img tag is how to get and that all the steps there mention. ..

+2
source

You can use one of these custom plugins to parse xpath based XML files (or css selectors):

+1
source

You can check out Nutch Plugin , which should allow you to retrieve an item from a web page.

0
source

Source: https://habr.com/ru/post/1433220/


All Articles