Selenium assembly list of 404s

Question

Selenium assembly list of 404s

Is it possible that Selenium scans TLDs and gradually exports a list of found 404s?

I was stuck on a Windows machine for several hours and want to run some tests before returning to the comfort of * nix ...

+4

selenium web-scraping

Leon stafford Oct 13 '12 at 5:00

source share

1 answer

Jimevans · Accepted Answer · 2013-06-18T22:01:34+0000

I don't know Python very well and none of its commonly used libraries, but I would probably do something like this (using C # code as an example, but the concept should apply):

// WARNING! Untested code here. May not completely work, and // is not guaranteed to even compile. // Assume "driver" is a validly instantiated WebDriver instance // (browser used is irrelevant). This API is driver.get in Python, // I think. driver.Url = "http://my.top.level.domain/"; // Get all the links on the page and loop through them, // grabbing the href attribute of each link along the way. // (Python would be driver.find_elements_by_tag_name) List<string> linkUrls = new List<string>(); ReadOnlyCollection<IWebElement> links = driver.FindElement(By.TagName("a")); foreach(IWebElement link in links) { // Nice side effect of getting the href attribute using GetAttribute() // is that it returns the full URL, not relative ones. linkUrls.Add(link.GetAttribute("href")); } // Now that we have all of the link hrefs, we can test to // see if they're valid. List<string> validUrls = new List<string>(); List<string> invalidUrls = new List<string>(); foreach(string linkUrl in linkUrls) { HttpWebRequest request = WebRequest.Create(linkUrl) as HttpWebRequest; request.Method = "GET"; // For actual .NET code, you'd probably want to wrap this in a // try-catch, and use a null check, in case GetResponse() throws, // or returns a type other than HttpWebResponse. For Python, you // would use whatever HTTP request library is common. // Note also that this is an extremely naive algorithm for determining // validity. You could just as easily check for the NotFound (404) // status code. HttpWebResponse response = request.GetResponse() as HttpWebResponse; if (response.StatusCode == HttpStatusCode.OK) { validUrls.Add(linkUrl); } else { invalidUrls.Add(linkUrl); } } foreach(string invalidUrl in invalidUrls) { // Here is where you'd log out your invalid URLs }

You currently have a list of valid and invalid URLs. You can combine all of this into a method with which you could pass the TLD URL and call it recursively with each of the valid URLs. The key bit here is that you are not using Selenium to actually determine the validity of links. And you don’t want to “click” on links to go to the next page if you are actually doing a recursive crawl. Rather, you want to go directly to the links found on the page.

There are other approaches that you can take, for example, run everything through a proxy server and thus capture response codes. It depends a little on how you plan to structure your decision.

Selenium assembly list of 404s

More articles: