Scrapy as a tool for Nodejs?

I would like to know if there is something like Scrapy for nodejs ?. if not, what do you think of using simple page loading and analyzing it with cheerio? there is a better way.

+7
source share
4 answers

I have not seen such a powerful solution for crawling / indexing entire sites such as Scrapy in python, so I personally use Python Scrapy to crawl websites.

But for scrambling data from pages in nodejs there is casperjs . This is a very cool solution. It also works on ajax websites, for example. angular -js pages. Python Scrapy cannot parse ajax pages. Therefore, to clear data for one or more pages, I prefer to use CasperJs.

Cheerio works faster than casperjs, but it does not work with ajax pages, and it does not have such a good code structure as casperjs. Therefore, I prefer casperjs, even if you can use the cheerio package.

Coffee script example:

casper.start 'https://reports.something.com/login', -> this.fill 'form', username: params.username password: params.password , true casper.thenOpen queryUrl, {method:'POST', data:queryData}, -> this.click 'input' casper.then -> get = (number) => value = this.fetchText("tr[bgcolor= '#AFC5E4'] > td:nth-of-type(#{number})").trim() 
+2
source

Scrapy is a library that adds asynchronous I / O in python. The reason we don’t have something similar for the node is because all the I / O operations are already asynchronous (unless you need it).

Here's what a scrapy script might look like in a host, and note that the URLs are being processed at the same time.

 const cheerio = require('cheerio'); const axios = require('axios'); const startUrls = ['http://www.google.com/', 'http://www.amazon.com/', 'http://www.wikipedia.com/'] // this might be called a "middleware" in scrapy. const get = async url => { const response = await axios.get(url) return cheerio.load(response.data) } // this too. const output = item => { console.log(item) } // here is parse which is the initial scrapy callback const parse = async url => { const $ = await get(url) output({url, title: $('title').text()}) } // and here is the main execution. We wrap it in an async function to allow await. ;(async function(){ await Promise.all( startUrls.map(url => parse(url)) ) })() 
+1
source

Just in case, you still need an answer, https://www.npmjs.org/package/scrapy I have never tested it, but I think it can help. happy scrapping.

0
source

Some crawl features can be achieved with Google Puppeteer . According to the documentation:

Most of the things you can do manually in your browser can be done with Puppeteer ! Here are some examples to get you started:

  • Generate screenshots and PDF page files.
  • Scan the SPA (single-page application) and generate pre-rendered content (for example, "SSR" (server-side rendering)).
  • Automate form submission, user interface testing, keyboard input, etc.
  • Create a modern automated testing environment. Run your tests directly in the latest version of Chrome using the latest JavaScript and browser features.
  • Capture your site’s timeline to help diagnose performance issues.
  • Checking Chrome Extensions.
0
source

Source: https://habr.com/ru/post/1205861/


All Articles