Web Scan / Scraper - Build or Buy?

Question

Web Scan / Scraper - Build or Buy?

It seems to me that at this moment one tool would rise to dominance, because the process seems to be quite general: specify the start URL, interact with your forms and scripts, follow the links, upload data, rinse, repeat. Although Ive always got a certain sense of satisfaction when creating special applications to jump over hoops to get several hundred gigabytes of documents on my hard drive, I wonder if I'm not just re-creating the wheel.

I admit that I have not used some commercial products, such as Automation Anywhere, but since I am trying to do full-time work by doing what I really like, analyzing the data and not getting it, I hope that the wisdom of the crowd here can point me towards the final discussion. Are there just too many quirks to have situations with one tool - almost everything?

And let me clarify or complicate this - I looked at several browser macro tools like iRobot, iOpus and found them slow. For serious document collections, Id wants to run scanners in a cluster / cloud, so I just don't know how they will work in this environment. For my use case let's say I want

get about a million documents
from a site that does not require a login but uses javascript heavily for navigation.
Use Amazon or Azure servers to get the job done.

An example is this site from the US Census (there are more efficient ways to get data from them, but the site style is a good example of the amount of data and navigation):

http://factfinder2.census.gov/faces/nav/jsf/pages/searchresults.xhtml?ref=addr&refresh=t

+4

web-crawler web-scraping

SteveValarenti Feb 06 '12 at 16:42

source share

2 answers

Apache Nutch is a very powerful finder:

It is very scalable.
It has the ability to crawl only a specific website (ignoring external links).
One of the fastest open source scanners at the moment.
Created from scratch with big data (i.e. integrates with Hadoop and allows you to run MapReduce jobs on data) .
It has various plugins, including a JavaScript parser.

Of course, it is written in Java, but if you are familiar with C #, then Java should not be so foreign. Some people are concerned about the complexity of Nutch, but for anyone seriously handling this, it will be much easier to find out Nutch's caveats than creating a similar web crawler.

+2

Kiril Feb 06 '12 at 18:41

source share

David · Accepted Answer · 2012-02-06T18:41:03+0000

Since it tends to be a somewhat gray area in the software world, such tools seem to be slowly appearing.

In similar areas, such as automatically testing sites using headless browsers (such as iRobot and iOpus, which you mentioned), considerable work is being done. I see that Selenium has mentioned too much, and there are some interesting tools that use Apple Webkit, such as phantomjs , but I can not comment on their speed or "cloud ability".

An interesting option that has been gaining significant traction recently may be node.js javascript runtime. The last thing I checked (6 months ago) was related to some projects using node for scrambling using a very lightweight javascript-interpretation browser .... And I believe that there are node parameters for the cloud already.

However, AFAIK, the fastest scrapers, are those that do not interpret javascript, and rely on the old-fashioned preliminary decomposition of the developer’s HTTP requests.

Web Scan / Scraper - Build or Buy?

More articles: