Faster way to upload multiple files

I need to download about 2 million files from the SEC website. each file has a unique URL and averages 10 KB. this is my current implementation:

List<string> urls = new List<string>(); // ... initialize urls ... WebBrowser browser = new WebBrowser(); foreach (string url in urls) { browser.Navigate(url); while (browser.ReadyState != WebBrowserReadyState.Complete) Application.DoEvents(); StreamReader sr = new StreamReader(browser.DocumentStream); StreamWriter sw = new StreamWriter(), url.Substring(url.LastIndexOf('/'))); sw.Write(sr.ReadToEnd()); sr.Close(); sw.Close(); } 

the predicted time is about 12 days ... is there a faster way?

Edit: btw, processing a local file takes only 7% of the time

Edit: This is my final implementation:

  void Main(void) { ServicePointManager.DefaultConnectionLimit = 10000; List<string> urls = new List<string>(); // ... initialize urls ... int retries = urls.AsParallel().WithDegreeOfParallelism(8).Sum(arg => downloadFile(arg)); } public int downloadFile(string url) { int retries = 0; retry: try { HttpWebRequest webrequest = (HttpWebRequest)WebRequest.Create(url); webrequest.Timeout = 10000; webrequest.ReadWriteTimeout = 10000; webrequest.Proxy = null; webrequest.KeepAlive = false; webresponse = (HttpWebResponse)webrequest.GetResponse(); using (Stream sr = webrequest.GetResponse().GetResponseStream()) using (FileStream sw = File.Create(url.Substring(url.LastIndexOf('/')))) { sr.CopyTo(sw); } } catch (Exception ee) { if (ee.Message != "The remote server returned an error: (404) Not Found." && ee.Message != "The remote server returned an error: (403) Forbidden.") { if (ee.Message.StartsWith("The operation has timed out") || ee.Message == "Unable to connect to the remote server" || ee.Message.StartsWith("The request was aborted: ") || ee.Message.StartsWith("Unable to read data from the transport connection: ") || ee.Message == "The remote server returned an error: (408) Request Timeout.") retries++; else MessageBox.Show(ee.Message, "Error", MessageBoxButtons.OK, MessageBoxIcon.Error); goto retry; } } return retries; } 
+6
source share
3 answers

Run the downloads simultaneously, and not sequentially, and set up reasonable MaxDegreeOfParallelism, otherwise you will try to make too many simultaneous requests that look like a DOS attack:

  public static void Main(string[] args) { var urls = new List<string>(); Parallel.ForEach( urls, new ParallelOptions{MaxDegreeOfParallelism = 10}, DownloadFile); } public static void DownloadFile(string url) { using(var sr = new StreamReader(HttpWebRequest.Create(url).GetResponse().GetResponseStream())) using(var sw = new StreamWriter(url.Substring(url.LastIndexOf('/')))) { sw.Write(sr.ReadToEnd()); } } 
+11
source

Upload files to multiple streams. The number of threads depends on your bandwidth. Also, check out the WebClient and HttpWebRequest classes. A simple example:

 var list = new[] { "http://google.com", "http://yahoo.com", "http://stackoverflow.com" }; var tasks = Parallel.ForEach(list, s => { using (var client = new WebClient()) { Console.WriteLine("starting to download {0}", s); string result = client.DownloadString((string)s); Console.WriteLine("finished downloading {0}", s); } }); 
+6
source

I would use multiple threads in parallel with WebClient . I recommend setting the maximum degree of parallelism to the number of threads required, since an indefinite degree of parallelism does not work well for long-running tasks. I used 50 concurrent downloads in one of my projects without problems, but depending on the speed of an individual download, it can be much lower.

If you download multiple files in parallel from the same server, by default you are limited to a small number (2 or 4) of parallel downloads. Although the http standard indicates such a low limit, many servers do not use it. Use ServicePointManager.DefaultConnectionLimit = 10000; to increase the limit.

+5
source

Source: https://habr.com/ru/post/906022/


All Articles