The SEO aspect is usually found in words in the URL, so you can probably ignore any parts that are numeric. Typically, SEO is applied to a group of similar content, such as a common base URL, for example:
Base www.domain.ext/article , with full URL examples:
- www.domain.ext / articles / 2011/06/15 / man-bites-dog
- www.domain.ext / articles / 2010/12/01 / beauty-not-only-shallow
So the SEO aspect of the URL is a suffix. The application algorithm typifies each “folder” after the common base assigns it a “data type” - numerical, text, alphanumeric and then evaluated as follows:
- The HTTP response code 200 : should be obvious, but you can get 404
www.domain.ext/errors/file-not-found , which would pass the other checks listed. - Non Numeric, delimited, spellcheck : delimiters, usually hyphens, underscores, or spaces. Take each word and do a spell check. If the words are valid - including proper names.
- Checked spelling text on the page , if the text passes spellcheck, analyze the contents of the page to see if it appears there.
- Verified spelling URL on the page inside the tag . If true, check again if the entire text is inside the HTML tag.
- The tag is important : if the previous value is true, and the tag is
<title> or <h#> .
Typically, with this approach, you will get a maximum of 5 points if only a few folders in the URL do not meet the criteria, with higher values being better. Now you can probably improve this by using the Bayesian probability method , which uses the above to reinforce (i.e., detect the occurrence of a phenomenon) URLs, plus come with some other smart signs. But then you need to prepare an algorithm that may not be worth it.
Now, based on your example, you also want to capture situations where the URL was designed so that the crawler is indexed, because the query parameters are now part of the URL. In this case, you can still typify suffix folders to obtain data type templates — in the example of your example, when the common prefix is always bound with an integer, and evaluate these URLs as SEO friendly.
source share