I am interested in crawling through many websites. The most important consideration is that the spider is able to reach as much space as possible. One of the key features that most spiders lack is the ability to execute JavaScript. This is necessary to scan sites with ajax support. I really like Open Source, and I will need to change the code for my project.
Currently, I believe that Solr, which is different from Lucine, is a very good solution.
http://lucene.apache.org/solr/features.html
Has anyone used Solr or Lucine? My biggest problem with Solr cannot run javascript, however, it has a rich feature set and scalability, which makes Solr attractive.
source
share