Web Scan / Scraper - Build or Buy?

It seems to me that at this moment one tool would rise to dominance, because the process seems to be quite general: specify the start URL, interact with your forms and scripts, follow the links, upload data, rinse, repeat. Although Ive always got a certain sense of satisfaction when creating special applications to jump over hoops to get several hundred gigabytes of documents on my hard drive, I wonder if I'm not just re-creating the wheel.

I admit that I have not used some commercial products, such as Automation Anywhere, but since I am trying to do full-time work by doing what I really like, analyzing the data and not getting it, I hope that the wisdom of the crowd here can point me towards the final discussion. Are there just too many quirks to have situations with one tool - almost everything?

And let me clarify or complicate this - I looked at several browser macro tools like iRobot, iOpus and found them slow. For serious document collections, Id wants to run scanners in a cluster / cloud, so I just don't know how they will work in this environment. For my use case let's say I want

  • get about a million documents
  • from a site that does not require a login but uses javascript heavily for navigation.
  • Use Amazon or Azure servers to get the job done.

An example is this site from the US Census (there are more efficient ways to get data from them, but the site style is a good example of the amount of data and navigation):

http://factfinder2.census.gov/faces/nav/jsf/pages/searchresults.xhtml?ref=addr&refresh=t

+4
source share
2 answers

Since it tends to be a somewhat gray area in the software world, such tools seem to be slowly appearing.

In similar areas, such as automatically testing sites using headless browsers (such as iRobot and iOpus, which you mentioned), considerable work is being done. I see that Selenium has mentioned too much, and there are some interesting tools that use Apple Webkit, such as phantomjs , but I can not comment on their speed or "cloud ability".

An interesting option that has been gaining significant traction recently may be node.js javascript runtime. The last thing I checked (6 months ago) was related to some projects using node for scrambling using a very lightweight javascript-interpretation browser .... And I believe that there are node parameters for the cloud already.

However, AFAIK, the fastest scrapers, are those that do not interpret javascript, and rely on the old-fashioned preliminary decomposition of the developer’s HTTP requests.

+2
source

Apache Nutch is a very powerful finder:

Of course, it is written in Java, but if you are familiar with C #, then Java should not be so foreign. Some people are concerned about the complexity of Nutch, but for anyone seriously handling this, it will be much easier to find out Nutch's caveats than creating a similar web crawler.

+2
source

Source: https://habr.com/ru/post/1395056/


All Articles