I have not seen such a powerful solution for crawling / indexing entire sites such as Scrapy in python, so I personally use Python Scrapy to crawl websites.
But for scrambling data from pages in nodejs there is casperjs . This is a very cool solution. It also works on ajax websites, for example. angular -js pages. Python Scrapy cannot parse ajax pages. Therefore, to clear data for one or more pages, I prefer to use CasperJs.
Cheerio works faster than casperjs, but it does not work with ajax pages, and it does not have such a good code structure as casperjs. Therefore, I prefer casperjs, even if you can use the cheerio package.
Coffee script example:
casper.start 'https://reports.something.com/login', -> this.fill 'form', username: params.username password: params.password , true casper.thenOpen queryUrl, {method:'POST', data:queryData}, -> this.click 'input' casper.then -> get = (number) => value = this.fetchText("tr[bgcolor= '#AFC5E4'] > td:nth-of-type(#{number})").trim()
source share