The best open source spider to cover a site

I am interested in crawling through many websites. The most important consideration is that the spider is able to reach as much space as possible. One of the key features that most spiders lack is the ability to execute JavaScript. This is necessary to scan sites with ajax support. I really like Open Source, and I will need to change the code for my project.

Currently, I believe that Solr, which is different from Lucine, is a very good solution. http://lucene.apache.org/solr/features.html

Has anyone used Solr or Lucine? My biggest problem with Solr cannot run javascript, however, it has a rich feature set and scalability, which makes Solr attractive.

+3
source share
5 answers

Solr is not a crawler, but a search engine (searches by index to return results).

However, I really like heritrix for its flexibility. Most crawlers will not run Javascript (but some, like Heritrix, will try to extract links from it), as that doesn't make much sense even today. The point is, Heritrix will let you hook your own classes to do anything you want with workarounds.

+4
source

Solr - , Lucene. . Apache Nutch. Cracking javascript , .

+2
source

watir may be useful to you.

+1
source

With pages that create dom based on the javascript template, you really need the full execution of javascript in your spider. Check out https://github.com/mikeal/spider for Node JS.

0
source

Source: https://habr.com/ru/post/1728261/


All Articles