The Most Effective Way To Download Thousands Of Web Pages

Question

The Most Effective Way To Download Thousands Of Web Pages

I have several thousand items. For each element, I need to load a web page and process this web page. Processing itself does not require a large processor overhead.

Right now, I am doing this synchronously using the webclient class, but it takes too much time. I am sure it can be easily paralyzed / asynchronized. But I'm looking for the most resource-efficient way to do this. Perhaps there are some restrictions on the number of active web requests, so I don’t like the idea of creating thousands of web clients and running asynchronous work on each of them. If this is not a real problem.

Is it possible to use parallel extensions and the Task class in C # 4?

Edit: Thanks for the answers. I was hoping for something using asynchronous operations, because the execution of a synchronous operation in the parallel will block only this thread.

+3

c # asynchronous c # -4.0 webclient

Euphoric Feb 22 '11 at 19:08

source share

3 answers

Joel Coehoorn · Answer 1 · 2011-02-22T19:38:18+0000

You want to use a structure called a producer / consumer queue. You queue all your URLs for processing and assign consumer threads to deactivate each URL (with the appropriate lock), and then load and process it.

, . 5 20 . , . , : , P4 dialup , , . .

andrewjs · Answer 2 · 2011-02-22T19:38:38+0000

Parallel.ForEach([ ], x = > YourDownloadFunction (x))

concurrency , .

Holystream · Answer 3 · 2011-02-22T20:04:54+0000

Use Thread. Parallel.ForEach has limited threads based on the number of cores / processors you have. Retrieving websites does not make the thread fully active throughout its work. There will be delays between requests (images, static content, etc.). So, use threads to maximize speed. Start with 50 threads, then go from there to find out how much your computer can process.

The Most Effective Way To Download Thousands Of Web Pages

More articles: