How to block malicious bots crawling my site?

How can I resist bad unidentified bots to crawl my site? Some bad bots, whose name is missing from cPanel Apache, mishandle my site bandwidth.

I tried robots.txt on batgap.com/robots.txt and is also blocked using .htaccess, but there is no improvement in bandwidth usage. I do not know the IP of these bots, so I can not block them by IP address. These bots consume too much website bandwidth and, therefore, I need to increase the result from the server.

+6
source share
4 answers

I am from Incapsula and we deal with bad bots on a regular basis.

We recently released a bot study that provides insight into the extent of the problem ( http://www.incapsula.com/the-incapsula-blog/item/225-what-google-doesnt-show-you-31-of-website -traffic-can-harm-your-business ) and in light of this data, I have to agree with @Leonard Challis - you simply cannot handle bot protection manually.

Having said that, there are bot protection solutions, even free ones (we have included) that can help you with bad bots.

BTW - As you already mentioned, one byproduct of unsuccessful bot visits is bandwidth loss. We recently found out how surprisingly HUGE the traffic usage associated with the bot is. This is an interesting topic in itself. We believe that by avoiding the bad traffic of bots, hosting providers can really significantly increase their efficiency (I hope using this to reduce costs or improve services). Once you understand the social and business consequences of this, you can understand the real scope of this bad bot problem that goes beyond the immediate damage.

+3
source

Unfortunately, robots.txt is sometimes ignored by these "bad bots", although if the problem is more than the real things on search engines that you don’t want to see, they should take this at the cost. I believe that with CPanel you can log into the web server logs (apache)? There you can look for two things: IP and User-Agent. You can find the culprits there and add them to your robots.txt and .htaccess. Please note that .htaccess rules prohibiting IP addresses are much better than just relying on robots.txt because you choose from the hands of the creator of the bot.

If you know the specific bots that do this, you should be able to get IP addresses and user agents from the forums, but if this is a more general thing, I'm really afraid that this is more manual work.

There are other methods that can be used with various effects, for example mod_security (http://www.askapache.com/htaccess/modsecurity-htaccess-tricks.html), but this will mean that you will need to access your configuration web server.

Finally, you can check the links pointing to your website (using the link: google option). Sometimes, if you have links to spam forums or the like, this can increase the likelihood that bots will come to you. You may be able to look at the referent URL in apache logs, but this is all based on many presumptions, and you are likely to be lucky if it had a big effect.

+1
source

I block "bad bots" with PHP. I filter the IP address first and then the User-Agent a second time. I force the bad bot to wait up to 999 seconds and then return a very small web page. Usually (always) the Internet connection timeout and zero (0) bytes are output. Best of all, I delayed them for several minutes before getting to the next victim. http://gelm.net/How-to-block-Baidu-with-PHP.htm

+1
source

Block unwanted Spiders robots / visitors via PHP

Instruction:

Put the following PHP code at the top of your index.php file.

The idea here is to place the code on the main home page of the PHP site, the main entry point to the site.

If you have other PHP files that are accessed directly through the URL (not including PHP, including or support type files), put the code at the beginning of these files. For most PHP sites and CMS PHP sites, the root index.php file is the file that is the main entry point to the site.

Keep in mind that your site’s statistics, that is, AWStats, will still register hits under an unknown robot (identified by a “bot” followed by a space or one of the following characters: +:,.; / -), but these bots will blocked from access to your site.

<?php // --------------------------------------------------------------------------------------------------------------- // Banned IP Addresses and Bots - Redirects banned visitors who make it past the .htaccess and or robots.txt files to an URL. // The $banned_ip_addresses array can contain both full and partial IP addresses, ie Full = 123.456.789.101, Partial = 123.456.789. or 123.456. or 123. // Use partial IP addresses to include all IP addresses that begin with a partial IP addresses. The partial IP addresses must end with a period. // The $banned_bots, $banned_unknown_bots, and $good_bots arrays should contain keyword strings found within the User Agent string. // The $banned_unknown_bots array is used to identify unknown robots (identified by 'bot' followed by a space or one of the following characters _+:,.;/\-). // The $good_bots array contains keyword strings used as exemptions when checking for $banned_unknown_bots. If you do not want to utilize the $good_bots array such as // $good_bots = array(), then you must remove the the keywords strings 'bot.','bot/','bot-' from the $banned_unknown_bots array or else the good bots will also be banned. $banned_ip_addresses = array('41.','64.79.100.23','5.254.97.75','148.251.236.167','88.180.102.124','62.210.172.77','45.','195.206.253.146'); $banned_bots = array('.ru','AhrefsBot','crawl','crawler','DotBot','linkdex','majestic','meanpath','PageAnalyzer','robot','rogerbot','semalt','SeznamBot','spider'); $banned_unknown_bots = array('bot ','bot_','bot+','bot:','bot,','bot;','bot\\','bot.','bot/','bot-'); $good_bots = array('Google','MSN','bing','Slurp','Yahoo','DuckDuck'); $banned_redirect_url = 'http://english-1329329990.spampoison.com'; // Visitor IP address and Browser (User Agent) $ip_address = $_SERVER['REMOTE_ADDR']; $browser = $_SERVER['HTTP_USER_AGENT']; // Declared Temporary Variables $ipfound = $piece = $botfound = $gbotfound = $ubotfound = ''; // Checks for Banned IP Addresses and Bots if($banned_redirect_url != ''){ // Checks for Banned IP Address if(!empty($banned_ip_addresses)){ if(in_array($ip_address, $banned_ip_addresses)){$ipfound = 'found';} if($ipfound != 'found'){ $ip_pieces = explode('.', $ip_address); foreach ($ip_pieces as $value){ $piece = $piece.$value.'.'; if(in_array($piece, $banned_ip_addresses)){$ipfound = 'found'; break;} } } if($ipfound == 'found'){header("location: $banned_redirect_url"); exit();} } // Checks for Banned Bots if(!empty($banned_bots)){ foreach ($banned_bots as $bbvalue){ $pos1 = stripos($browser, $bbvalue); if($pos1 !== false){$botfound = 'found'; break;} } if($botfound == 'found'){header("location: $banned_redirect_url"); exit();} } // Checks for Banned Unknown Bots if(!empty($good_bots)){ foreach ($good_bots as $gbvalue){ $pos2 = stripos($browser, $gbvalue); if($pos2 !== false){$gbotfound = 'found'; break;} } } if($gbotfound != 'found'){ if(!empty($banned_unknown_bots)){ foreach ($banned_unknown_bots as $bubvalue){ $pos3 = stripos($browser, $bubvalue); if($pos3 !== false){$ubotfound = 'found'; break;} } if($ubotfound == 'found'){header("location: $banned_redirect_url"); exit();} } } } // --------------------------------------------------------------------------------------------------------------- ?> 
+1
source

Source: https://habr.com/ru/post/912046/


All Articles