It seems to me that at this moment one tool would rise to dominance, because the process seems to be quite general: specify the start URL, interact with your forms and scripts, follow the links, upload data, rinse, repeat. Although Ive always got a certain sense of satisfaction when creating special applications to jump over hoops to get several hundred gigabytes of documents on my hard drive, I wonder if I'm not just re-creating the wheel.
I admit that I have not used some commercial products, such as Automation Anywhere, but since I am trying to do full-time work by doing what I really like, analyzing the data and not getting it, I hope that the wisdom of the crowd here can point me towards the final discussion. Are there just too many quirks to have situations with one tool - almost everything?
And let me clarify or complicate this - I looked at several browser macro tools like iRobot, iOpus and found them slow. For serious document collections, Id wants to run scanners in a cluster / cloud, so I just don't know how they will work in this environment. For my use case let's say I want
- get about a million documents
- from a site that does not require a login but uses javascript heavily for navigation.
- Use Amazon or Azure servers to get the job done.
An example is this site from the US Census (there are more efficient ways to get data from them, but the site style is a good example of the amount of data and navigation):
http://factfinder2.census.gov/faces/nav/jsf/pages/searchresults.xhtml?ref=addr&refresh=t
source share