Can a lot of data exceed the stack size in Node.js?

I am not very familiar with the internal workings of Node.js, but as far as I know, you get "Maximum Call Stack Size" errors when you make too many function calls.

I make a spider that will follow the links, and I started getting these erros after a random number of workarounds. Node doesn't give you a stack trace when this happens, but I'm sure I don't have recursion errors.

I use request to retrieve URLs, and I used cheerio to parse selected HTML and discover new links. Stack overflows always occurred inside cheerio. When I changed cheerio to htmlparser2 , the errors disappeared. Htmlparser2 is much easier because it simply generates events on every open tag instead of parsing entire documents and building a tree.

My theory is that cheerio swallowed all the memory on the stack, but I'm not sure if this is even possible?

Here is a simplified version of my code (it is read-only, it will not work):

var _ = require('underscore'); var fs = require('fs'); var urllib = require('url'); var request = require('request'); var cheerio = require('cheerio'); var mongo = "This is a global connection to mongodb."; var maxConc = 7; var crawler = { concurrent: 0, queue: [], fetched: {}, fetch: function(url) { var self = this; self.concurrent += 1; self.fetched[url] = 0; request.get(url, { timeout: 10000, pool: { maxSockets: maxConc } }, function(err, response, body){ self.concurrent -= 1; self.fetched[url] = 1; self.extract(url, body); }); }, extract: function(referrer, data) { var self = this; var urls = []; mongo.pages.insert({ _id: referrer, html: data, time: +(new Date) }); /** * THE ERROR HAPPENS HERE, AFTER A RANDOM NUMBER OF FETCHED PAGES **/ cheerio.load(data)('a').each(function(){ var href = resolve(this.attribs.href, referer); // resolves relative urls, not important // Save the href only if it hasn't been fetched, it not already in the queue and it not already on this page if(href && !_.has(self.fetched, href) && !_.contains(self.queue, href) && !_.contains(urls, href)) urls.push(href); }); // Check the database to see if we already visited some urls. mongo.pages.find({ _id: { $in: urls } }, { _id: 1 }).toArray(function(err, results){ if(err) results = []; else results = _.pluck(results, '_id'); urls = urls.filter(function(url){ return !_.contains(results, url); }); self.push(urls); }); }, push: function(urls) { Array.prototype.push.apply( this.queue, urls ); var url, self = this; while((url = self.queue.shift()) && this.concurrent < maxConc) { self.fetch( url ); } } }; crawler.fetch( 'http://some.test.url.com/' ); 
+4
source share
2 answers

It looks like you have some kind of recursion there. Recursive function calls ultimately exceed the stack because function pointers are stored here.

So, here's how it goes:

  • extract extract calls in request.get callback
  • call calls click on mongo.pages.find callback
  • select push requests inside while loop

This loop seems to repeat until the stack ends.

In your case, the stack runs very low by the time you call cheerio.load , so it ends right there and there.

Although you most likely want to check if this is a bug or something that you intended to get the same effect in nodejs without using direct recursion, you should use:

process.nextTick(functionToCall) .

It will leave a private function that pops the pointer out of the stack, but presses functionToCall on the next tick.

You can try it in noderepl:

process.nextTick(function () { console.log('hello'); })

will print 'hello' immediately.

Its value corresponds to setTimeout(functionToCall, 0) , but is preferable.

Regarding your code, you can replace self.fetch(url) with process.nextTick(function () { self.fetch(url); }) and should no longer end from the stack.

As stated above, it is more likely that there is an error in your code, so take a look at this first.

0
source

You are decreasing with self.concurrent -= 1; too soon, you should reduce it to extract after all asynchronous things are done. This is one of them that sticks out. Not sure if he will allow it.

0
source

Source: https://habr.com/ru/post/1436029/


All Articles