C # Loading a website into a string using C # WebClient or HttpWebRequest

I am trying to download the contents of a website. However, for a specific web page, the returned string contains mixed data containing many characters.

Here is the code I originally used.

HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(url); req.Method = "GET"; req.UserAgent = "Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US))"; string source; using (StreamReader reader = new StreamReader(req.GetResponse().GetResponseStream())) { source = reader.ReadToEnd(); } HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); doc.LoadHtml(source); 

I also tried alternative implementations with WebClient, but still the same result:

 HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); using (WebClient client = new WebClient()) using (var read = client.OpenRead(url)) { doc.Load(read, true); } 

From the search, I assume this may be an encoding problem, so I tried both of the solutions posted below, but still can't get it to work.

An offensive site that I cannot download is the United_States article on the English version of WikiPedia (en. Wikipedia. Org / wiki / United_States). Although I tried a number of other Wikipedia articles and did not see this problem.

+6
source share
3 answers

The answer is gzip encrypted. To decode a stream, do the following:

UPDATE

Based on the comment using the BrokenGlass installation, the following properties should solve your problem (worked for me):

 req.Headers[HttpRequestHeader.AcceptEncoding] = "gzip, deflate"; req.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip; 

Old / Manual solution:

 string source; var response = req.GetResponse(); var stream = response.GetResponseStream(); try { if (response.Headers.AllKeys.Contains("Content-Encoding") && response.Headers["Content-Encoding"].Contains("gzip")) { stream = new System.IO.Compression.GZipStream(stream, System.IO.Compression.CompressionMode.Decompress); } using (StreamReader reader = new StreamReader(stream)) { source = reader.ReadToEnd(); } } finally { if (stream != null) stream.Dispose(); } 
+2
source

Using the built-in bootloader in HtmlAgilityPack worked for me:

 HtmlWeb web = new HtmlWeb(); HtmlDocument doc = web.Load("http://en.wikipedia.org/wiki/United_States"); string html = doc.DocumentNode.OuterHtml; // I don't see no jumbled data here 

Edit:

Using the standard WebClient with your user agent will disable HTTP 403 - using this instead worked for me:

 using (WebClient wc = new WebClient()) { wc.Headers.Add("user-agent", "Mozilla/5.0 (Windows; Windows NT 5.1; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4"); string html = wc.DownloadString("http://en.wikipedia.org/wiki/United_States"); HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html); } 

Also see this SO stream: WebClient prohibits opening a page on Wikipedia?

+3
source

This is how I usually grab a page into a string (its VB, but need to be translated easily):

 req = Net.WebRequest.Create("http://www.cnn.com") Dim resp As Net.HttpWebResponse = req.GetResponse() sr = New IO.StreamReader(resp.GetResponseStream()) lcResults = sr.ReadToEnd.ToString 

and you had no problems.

+1
source

Source: https://habr.com/ru/post/897843/


All Articles