We are at the initial stage of the project, and currently we are wondering if which seeker is suitable for us.
Our project:
Basically, we are going to set up Hadoop and scan the Internet for images. Then we will launch our own software for indexing images stored in HDFS based on the Map / Reduce object in Hadoop. We will not use other indexing than our own.
Some specific questions:
- Which crawler will handle scanning for images best?
- Which crawler best adapts to a distributed crawl system in which we use many servers scanning together?
Right now they look like the 3 best options -
- Nutch: Known to scale. It doesn't seem like the best option, because it seems to be closely related to their text search software.
- Geritrix: Also scales. Currently, this option looks like the best option.
- Scrapy: Not used on a large scale (not sure though). I don’t know if it has basic stuff like URL canonicalization. I would like to use this because it is a python framework (I like python more than java), but I don’t know if they implemented advanced web crawler functions.
Summary:
We need to get as many images from the Internet as possible. Which existing workaround is scalable and efficient, as well as the one that would be easiest to change to get only images?
Thanks!