Block Googlebots for URLs containing a specific word

Question

Block Googlebots for URLs containing a specific word

my client has loading pages that they don’t want to index google - they are all called

http://example.com/page-xxx

so that they are / page -123 or / page-2 or / page -25, etc.

Is there any way to stop Google indexing any page that starts with / page -xxx using robots.txt

will there be something like this work?

 Disallow: /page-*

thanks

+6

robots.txt

Jorgeluisborges Jul 28 '11 at 13:03

source share

3 answers

Jim mischel · Answer 1 · 2011-07-28T14:54:55+0000

First, the line that says Disallow: /post-* is not going to do anything to prevent crawling pages of the form "/ page-xxx". Did you mean to put a "page" in your Disallow line, not a "post"?

Disallow says, in effect, "ban URLs that begin with this text." So your example line will ban the URL starting with "/ post-". (That is, the file is in the root directory, and its name begins with "post-".) The asterisk in this case is redundant, as it is implied.

Your question is not clear where the pages are. If they are all in the root directory, then a simple Disallow: /page- will work. If they are scattered across catalogs in many different places, then this is a little more complicated.

As @ user728345 noted, the easiest way (in terms of robots.txt) is to collect all the pages that you don’t want to crawl into a single directory and deny access to this. But I understand that you cannot move all these pages.

For Googlebot and other bots that support the same semantics of templates (an amazing number of them, including mine), the following should work:

Disallow: /*page-

This will match any that contains "page-" anywhere. However, it also blocks something like "/test/thispage-123.html". If you want to prevent this, I think (I'm not sure since I have not tried) that this will work:

Disallow: */page-

Travis pessetto · Answer 2 · 2011-07-28T13:25:52+0000

It looks like * will work like a wild Google map, so your answer will make Google crawl, however, wildcards are not supported by other spiders. You can search google for group robot.txt templates for more information. I would look at http://seogadget.co.uk/wildcards-in-robots-txt/ for more information.

Then I pulled this from Google documentation:

Pattern matching
Googlebot (but not all search engines) respects some matching patterns.
To match a sequence of characters, use an asterisk (*). For example, to block access to all> subdirectories starting with private:
User Agent: Googlebot Disallow: / private * /
To block access to all URLs containing a question mark (?) (More precisely, any URL starting with your domain name, followed by any line, followed by a question mark, followed by any line):
User Agent: Googlebot Disallow: / *?
To indicate a match for the end of the URL, use $. For example, to block any URLs that end in .xls:
User Agent: Googlebot Disallow: / *. Xls $
You can use this template in conjunction with the Allow directive. For example, if? indicates the session identifier, you can exclude all URLs that contain them to ensure that Googlebot does not crawl duplicate pages. But urls that end in? there may be a version of the page you want to include. In this situation, you can install the robots.txt file as follows:
User Agent: * Allow: /? $ Disallow: /?
Ban:/*? will the directive block any url that includes? (more specifically, it will block any URL starting with your domain name, followed by any line, followed by a question mark, followed by any line).
Allow Directive: / *? $ resolves any url that ends in? (more specifically, it will resolve any URL starting with your domain name, followed by a string followed by a?, without the characters after?).
Save the robots.txt file by downloading the file or copying the contents into a text file and save it as a robots.txt file. Save the file in the highest level directory of your site. The robots.txt file must be in the root of the domain and must have the name "robots.txt". The robots.txt file located in the subdirectory is invalid because bots only check this file in the root of the domain. For example, http://www.example.com/robots.txt is a valid location, but http://www.example.com/mysite/robots.txt is not.

Note. From what I read, this is a Google approach only. Officially there is no Wildcard ban in robots.txt for the ban.

obesechicken13 · Answer 3 · 2011-07-28T13:21:09+0000

You can put all the pages that you do not want to visit in a folder, and then use disallow to tell the bots not to visit pages in this folder.

Deny: / private /

I'm not very good at robots.txt, so I'm not sure how to use such wildcards. It says "you cannot use wildcard patterns or regular expressions in the User-agent or Disallow worksheet." http://www.robotstxt.org/faq/robotstxt.html

Block Googlebots for URLs containing a specific word

More articles: