Parallel cleaning in .NET.

The company I work for has several hundred very dynamic websites. He decided to build a search engine, and I was tasked with writing a scraper. Some of the sites run on old hardware and cannot take a lot of punishment, while others can handle a huge number of concurrent users.

I need to say that you need to use 5 parallel queries for site A, 2 for sites B and 1 for site C.

I know that for this I can use streams, mutexes, semaphores, etc., but it will be quite difficult. Are any of the higher level frameworks like TPL waiting / asynchronous, TPL Dataflow powerful enough to make this application an easier way?

+4
source share
2 answers

I recommend using HttpClientwith Task.WhenAll, with SemaphoreSlimfor simple throttling:

private SemaphoreSlim _mutex = new SemaphoreSlim(5);
private HttpClient _client = new HttpClient();
private async Task<string> DownloadStringAsync(string url)
{
  await _mutex.TakeAsync();
  try
  {
    return await _client.GetStringAsync(url);
  }
  finally
  {
    _mutex.Release();
  }
}

IEnumerable<string> urls = ...;
var data = await Task.WhenAll(urls.Select(url => DownloadStringAsync(url));

Alternatively, you can use the TPL data stream and set MaxDegreeOfParallelismfor throttling.

+10
source

TPL Dataflowand async-awaitreally powerful and simple, to be able to just what you need:

async Task<IEnumerable<string>> GetAllStringsAsync(IEnumerable<string> urls)
{
    var client = new HttpClient();
    var bag = new ConcurrentBag<string>();
    var block = new ActionBlock<string>(
        async url => bag.Add(await client.GetStringAsync(url)),
        new ExecutionDataflowBlockOptions {MaxDegreeOfParallelism = 5});
    foreach (var url in urls)
    {
        block.Post(url);
    }
    block.Complete();
    await block.Completion;
    return bag;
}
+1
source

Source: https://habr.com/ru/post/1530476/


All Articles