What is the best way to load <very large> number of pages from a list of urls?

Question

What is the best way to load <very large> number of pages from a list of urls?

I have> 100,000 URLs (different domains) in the list that I want to download and save in the database for further processing and processing.

Would it be prudent to use scrapy instead of python multiprocessing / multithreading? If so, how can I write a stand-alone script to do the same?

Also, feel free to suggest other amazing approaches that come to your mind.

+4

python multithreading multiprocessing web-crawler scrapy

Anuvrat parashar Jun 06 '13 at 8:32

source share

3 answers

Most site owners try to block you if you suddenly create hi-load.

Thus, even if you have a fixed list of links, you will need control timeouts, HTTP response codes, proxies, etc. on scrapy or grab

0

nk9 Jun 06 '13 at 10:07

source share

Scrapy is still an option.

Speed / Performance / Efficiency
Scrapy is written with Twisted, a popular event-driven network for Python. Thus, it is implemented using non-blocking (aka asynchronous) for concurrency.
Database pipelining
You mentioned that you want your data to be pipelined to the database, as you may know that Scrapy has an Item Pipelines function:
After the item has been cleaned by a spider, it is sent to the Pipeline item, which processes it through several components that are executed sequentially.
Thus, each page can be written to the database immediately after loading it.
Code Organization
Scrapy offers you a good and clear project structure, where you can logically set parameters, spiders, elements, pipelines, etc. Even this simplifies and simplifies your code.
Time for code
Scrapy does a lot of work for you backstage. This forces you to focus on the code itself and the logic itself, rather than thinking about the "metal" part: creating processes, threads, etc.

But, at the same time, Scrapy can be an overhead. Remember that Scrapy was designed (and is great) for scanning, cleaning data from a web page. If you just want to load a bunch of pages without looking at them - then yes, grequests are a good alternative.

0

alecxe Jun 06 '13 at 12:05

source share

icecrime · Accepted Answer · 2013-06-06T08:48:59+0000

Scrapy does not seem relevant here if you know the URL to extract very well (there is no workaround here).

The easiest way that comes to mind is to use Requests . However, requesting each URL in the sequence and block awaiting a response will not be effective, so you can consider GRequests to send batches of requests asynchronously.

What is the best way to load <very large> number of pages from a list of urls?

More articles: