Failed to complete promises due to out of memory

I have a script to clear ~ 1000 web pages. I use Promise.all to share them, and it returns when all pages are done:

Promise.all(urls.map(url => scrap(url))) .then(results => console.log('all done!', results)); 

This is nice and correct, with one exception: the machine goes out of memory due to parallel requests. I use jsdom for recycling, it quickly takes up several GB of memory, which is understandable given that it creates hundreds of window .

I have an idea to fix it, but I donโ€™t like it. That is, change the control flow so as not to use Promise.all, but combine my promises:

 let results = {}; urls.reduce((prev, cur) => prev .then(() => scrap(cur)) .then(result => results[cur] = result) // ^ not so nice. , Promise.resolve()) .then(() => console.log('all done!', results)); 

This is not as good as Promise.all ... Ineffective because it is bound, and the return values โ€‹โ€‹must be stored for later processing.

Any suggestions? Should I improve the control flow or improve the improved memory usage in scrap (), or is there a way to enable node throttle allocation?

+1
source share
1 answer

You are trying to run 1000 web scrapes in parallel. You will need to select a number significantly less than 1000 and only N at the same time, so that you consume less memory. You can still use the promise to keep track of when everything is done.

Bluebird Promise.map() can do this for you by simply passing the concurrency value as an option. Or you could write it yourself.

I have an idea to fix it, but I donโ€™t like it. That is, change control flow, so as not to use Promise.all, but combine my promises:

What you want is N operations in flight at the same time. Sequencing is a special case when N = 1 , which will often be much slower than some of them in parallel (possibly with N = 10 ).

This is not as good as Promise.all ... Not the way it is chained, and the return values โ€‹โ€‹must be stored for later processing.

If the stored values โ€‹โ€‹are part of the memory problem, you may have to store them out of memory anywhere. You will need to analyze how much memory the results are stored.

Any suggestions? Should I improve control flow or improve the use of mem in scrap (), or is there a way to allow node throttle mem distribution?

Use Bluebird Promise.map() or write something similar to yourself. Writing something that works up to N operations in parallel and keeps all the results in order is not rocket science, but it is a little work to fix it. I presented it earlier in a different answer, but I seem to be unable to find it right now. I will continue to search.

Found my previous answer: Make a few API requests that can only handle 20 requests per minute

+5
source

Source: https://habr.com/ru/post/1272470/


All Articles