What is the best way to load <very large> number of pages from a list of urls?

I have> 100,000 URLs (different domains) in the list that I want to download and save in the database for further processing and processing.

Would it be prudent to use scrapy instead of python multiprocessing / multithreading? If so, how can I write a stand-alone script to do the same?

Also, feel free to suggest other amazing approaches that come to your mind.

+4
source share
3 answers

Scrapy does not seem relevant here if you know the URL to extract very well (there is no workaround here).

The easiest way that comes to mind is to use Requests . However, requesting each URL in the sequence and block awaiting a response will not be effective, so you can consider GRequests to send batches of requests asynchronously.

+2
source

Most site owners try to block you if you suddenly create hi-load.

Thus, even if you have a fixed list of links, you will need control timeouts, HTTP response codes, proxies, etc. on scrapy or grab

0
source

Scrapy is still an option.

  • Speed ​​/ Performance / Efficiency

    Scrapy is written with Twisted, a popular event-driven network for Python. Thus, it is implemented using non-blocking (aka asynchronous) for concurrency.

  • Database pipelining

    You mentioned that you want your data to be pipelined to the database, as you may know that Scrapy has an Item Pipelines function:

    After the item has been cleaned by a spider, it is sent to the Pipeline item, which processes it through several components that are executed sequentially.

    Thus, each page can be written to the database immediately after loading it.

  • Code Organization

    Scrapy offers you a good and clear project structure, where you can logically set parameters, spiders, elements, pipelines, etc. Even this simplifies and simplifies your code.

  • Time for code

    Scrapy does a lot of work for you backstage. This forces you to focus on the code itself and the logic itself, rather than thinking about the "metal" part: creating processes, threads, etc.

But, at the same time, Scrapy can be an overhead. Remember that Scrapy was designed (and is great) for scanning, cleaning data from a web page. If you just want to load a bunch of pages without looking at them - then yes, grequests are a good alternative.

0
source

Source: https://habr.com/ru/post/1484782/


All Articles