As stated in the comments, the web browser approach seems to be difficult and will be subject to other environmental restrictions. Your best approach is to create a separate test repository for data cleaning - on demand or using the spider approach in advance if you really need to (and the target data does not change).
Yes, different browsers will have problems with it if you try to make it ActiveX. Security may not allow this. There are a lot of factors; if your environment is not controlled, this is not a great option.
Assuming you follow the on-demand basis, I would strongly suggest creating a web service or class that you can reference. Then you can use the open source parser server, for example:
- CsQuery if the document is poorly formed or
- Fizzler , if you can trust the integrity of the document.
Basically, you need to authenticate, save the authentication cookie, and finally download the target document on the second request filled in by your authentication cookie. Add this page to your parser (CsQuery or Fizzler).
An example of a login will be as follows:
private HttpWebRequest PerformLoginRequest(CookieContainer container) { var request = (HttpWebRequest) WebRequest.Create(YOUR_POST_URL); request.Method = "POST"; request.CookieContainer = container; _logger.DebugFormat("Attempting login for '{0}'", _username); var encoding = new ASCIIEncoding(); // assumes the un/pw is stored in a field var credentials = string.Format("username={0}&password={1}", _username, _password); byte[] data = encoding.GetBytes(credentials); request.ContentType = "application/x-www-form-urlencoded"; request.ContentLength = data.Length; using (var requestStream = request.GetRequestStream()) { try { requestStream.Write(data, 0, data.Length); } catch (Exception e) { _logger.Error("Error in login attempt.", e); } finally { requestStream.Close(); } } return request; }
A cookie will be set in the cookie container that is returned, which you will need to parse in order for subsequent requests to correctly display the authentication bits. I had to do this and worked out the code that I found somewhere here on SO, but now I can not find the link. It might look something like this (explanation here is Set-Cookie ):
private static CookieContainer ProcessCookieContainer(HttpWebRequest request, CookieContainer container) { var response = (HttpWebResponse) request.GetResponse(); var cookierReader = new StreamReader(response.GetResponseStream()); string htmldoc = cookierReader.ReadToEnd(); var cookieHeader = response.GetResponseHeader("Set-Cookie"); response.Close(); container = new CookieContainer(); foreach (var cookie in cookieHeader.Split(',')) {
And to download a document for analysis, you can do something like:
public string GetValueFromSomePage(int first, string second) { var container = new CookieContainer(); // do login var request = PerformLoginRequest(container); // chew on cookies container = ProcessCookieContainer(request, container); var result = string.Empty; var requestUrl = string.Format("http://YourUrlWithParams.com/?first={0}&second={1}", 123, "abc"); var request = (HttpWebRequest)WebRequest.Create(requestUrl); request.CookieContainer = container; using (var serverResponse = (HttpWebResponse)request.GetResponse()) { try { var reader = new StreamReader(serverResponse.GetResponseStream()); var responseDoc = new CQ(reader); // do something with CSS selectors... result = responseDoc["input[name=name]"].FirstElement().Value; } catch (Exception e) { _logger.Error("Error fetching data.", e); } finally { serverResponse.Close(); } } return result; }
Hope this helps. There are several moving parts here, but you probably expect that you have already set the nature of your task.
Greetings.