URL rewrite detection (SEO URL)

How a client can determine if a server uses search engine optimization methods , such as mod_rewrite, to implement "SEO friendly URLs".

For instance:

Normal URL: http://somedomain.com/index.php?type=pic&id=1

SEO friendly URL: http://somedomain.com/pic/1

+4
source share
7 answers

Since mod_rewrite works on the server side, the client cannot determine it exactly.

The only thing you can do on the client side is to find some hints:

  • Is HTML generated dynamic and which changes between calls? Then / pic / 1 will need to be processed using some script and most likely is not a true URL.
  • As said earlier: are there <link rel="canonical"> tags? Then the website likes to tell the search engine which URL is multiple with the same content from which it should use.
  • Change the parts of the url and see if you get 404. In /pic/1 I would change "1".
    If there is no mod_rewrite , it will return 404. If so, the error is processed by the server-side scripting language and can return 404, but in most cases will return a 200-page error print.
+5
source

The SEO aspect is usually found in words in the URL, so you can probably ignore any parts that are numeric. Typically, SEO is applied to a group of similar content, such as a common base URL, for example:

Base www.domain.ext/article , with full URL examples:

  • www.domain.ext / articles / 2011/06/15 / man-bites-dog
  • www.domain.ext / articles / 2010/12/01 / beauty-not-only-shallow

So the SEO aspect of the URL is a suffix. The application algorithm typifies each “folder” after the common base assigns it a “data type” - numerical, text, alphanumeric and then evaluated as follows:

  • The HTTP response code 200 : should be obvious, but you can get 404 www.domain.ext/errors/file-not-found , which would pass the other checks listed.
  • Non Numeric, delimited, spellcheck : delimiters, usually hyphens, underscores, or spaces. Take each word and do a spell check. If the words are valid - including proper names.
  • Checked spelling text on the page , if the text passes spellcheck, analyze the contents of the page to see if it appears there.
  • Verified spelling URL on the page inside the tag . If true, check again if the entire text is inside the HTML tag.
  • The tag is important : if the previous value is true, and the tag is <title> or <h#> .

Typically, with this approach, you will get a maximum of 5 points if only a few folders in the URL do not meet the criteria, with higher values ​​being better. Now you can probably improve this by using the Bayesian probability method , which uses the above to reinforce (i.e., detect the occurrence of a phenomenon) URLs, plus come with some other smart signs. But then you need to prepare an algorithm that may not be worth it.

Now, based on your example, you also want to capture situations where the URL was designed so that the crawler is indexed, because the query parameters are now part of the URL. In this case, you can still typify suffix folders to obtain data type templates — in the example of your example, when the common prefix is ​​always bound with an integer, and evaluate these URLs as SEO friendly.

+3
source

I assume that you will use curl options.

You can try sending the same request, but with different user agent values.

i.e. send one request using the Mozzilla / 5.0 user agent and a second time using the Googlebot User Agent, if the server does something special for web crawlers, then there should be a different answer

+1
source

As part of today's frameworks and URL routing, they do not require the use of mod_rewrite to create friendly URLs such as http://somedomain.com/pic/1 so that I doubt that you might find anything. I would create such URLs for all visitors, crawlers or not. Perhaps you can fake some bot headers to pretend you're a famous finder and see if there are any changes. Dunno how legal is tbh.

0
source

For a dynamic url template, it is better to use the <link rel="canonical" href="..." /> for other duplicates

0
source

It is better to use a "canonical" tag for all URLs to avoid any duplication.

0
source

Source: https://habr.com/ru/post/1337524/


All Articles