I am trying to download the contents of a website. However, for a specific web page, the returned string contains mixed data containing many characters.
Here is the code I originally used.
HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(url); req.Method = "GET"; req.UserAgent = "Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US))"; string source; using (StreamReader reader = new StreamReader(req.GetResponse().GetResponseStream())) { source = reader.ReadToEnd(); } HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); doc.LoadHtml(source);
I also tried alternative implementations with WebClient, but still the same result:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); using (WebClient client = new WebClient()) using (var read = client.OpenRead(url)) { doc.Load(read, true); }
From the search, I assume this may be an encoding problem, so I tried both of the solutions posted below, but still can't get it to work.
An offensive site that I cannot download is the United_States article on the English version of WikiPedia (en. Wikipedia. Org / wiki / United_States). Although I tried a number of other Wikipedia articles and did not see this problem.
source share