This is a complex problem, and no solution that I know of will be 100% different from the point of bot protection and usability. If your attacker is really determined to use a bot on your site, they are likely to be able to do this. If you do something far enough to make a computer program available for something on your website inappropriate, hardly anyone will want it, but you can achieve a good balance.
My point of view on this is partly a web developer, but especially on the other hand, having written many web crawler programs for clients around the world. Not all bots have malicious intentions, and they can be used to automate the automatic submission of forms to fill out databases of doctors' addresses or analyze data on the stock market. If your site is well designed in terms of ease of use, there should not be a bot that "simplifies" the user, but there are times when there are special needs that you cannot plan.
Of course, there are those who have evil intentions that you definitely want to protect your site from the possible. There is practically no site that somehow cannot be automated. Most sites are not complicated, but here are some ideas from my point of view, from other answers or comments on this page and from my experience in writing (not malicious) bots.
Types of bots
First I must mention that there are two different categories in which I would put bots:
- General purpose work robots, indexes or bots
- Special bots specifically designed for your site to perform certain tasks.
Typically, a general-purpose bot will be something like a search engine indexer, or perhaps a hacker script that looks for a form to submit, uses a dictionary attack to find a vulnerable URL, or something like that. They can also attack “engine sites”, such as Wordpress blogs. If your site is correctly protected with good passwords, etc., they usually will not represent most of the risk for you (if you do not use Wordpress, in which case you need to keep up with the latest versions and security updates).
Special personalized bots for special purposes is what I wrote. A bot designed specifically for your site can be designed to look very similar to the person on your site, including inserting time delays between form submissions, setting cookies, etc. Therefore, they are difficult to detect. For the most part, this is what I am talking about in the remainder of the answer.
CAPTCHAs
Captchas are probably the most common approach for making sure that the user is a humanoid, and it is usually difficult for them to get around automatically. However, if you just require captcha as a one-time thing, when the user creates an account, for example, it’s easy for a person to go past it and then give their shiny new account credentials to a bot to automate the use of the system.
I remember how a few years ago I read about a rather complicated system for “hacking” captchas on a popular gaming site: a separate site was created that downloaded captchas from a gaming site and presented them to users, where they were essentially from the crowd. Users on the second site will receive some kind of reward for each correct captcha, and the owners of the site were able to automate the tasks on the gaming site using their data with the information gathered by the crowd.
As a rule, the use of a good captcha system may well guarantee one thing: somewhere there is a person who typed the text with the inscription. What happens before and after that depends on how often you need a verification check, and how the person making the bot is determined.
Cell Phone / Credit Card Check
If you do not want to use Captchas, this type of check is likely to be quite effective for everyone except the most determined bot writer. Although (as in the case of captcha) this will not prevent an already verified user from creating and using a bot, you can make sure that the person has created an account, and if abuse blocks the use of this phone / credit card number to create another account.
Sites like Facebook and Craigslist have begun using cell phone checking to prevent spam from bots. For example, to create applications on Facebook, you must have a phone number on the record, confirmed by a text message or an automatic phone call. If your attacker does not have access to many active phone numbers, this can be an effective way to verify that a person has created an account and that he creates only a limited number of accounts (one for most people).
Credit cards can also be used to confirm that a person performs an action and limits the number of accounts that one person can create.
Other [less effective] solutions
Magazine analysis
Analysis of your query logs often shows that bots repeat the same actions several times, or sometimes use dictionary attacks to search for holes in your site’s configuration. Thus, the magazines will tell you after the request was made by a bot or a person. This may or may not be useful to you, but if requests were made to a verified cell phone or credit card account, you can block the account associated with abusive requests to prevent further abuse.
Mathematics / Other Matters
Math problems or other questions can be answered with a quick google or wolfram alpha , which can be automated by a bot. Some questions will be more complicated than others, but large search companies work against you here, making their engines better understand issues like this, and in turn make this a less viable option to verify that the user is human.
Hidden form fields
Some sites use a mechanism in which parameters, such as mouse coordinates, when they click the submit button, are added to the submit form via javascript. In most cases, they are very easy to fake, but if you see in your logs a whole bunch of requests using the same coordinates, they are probably a bot (although a smart bot can easily give different coordinates with each request).
Javascript Cookies
Since most bots don’t download or run javascript, cookies set using javascript instead of the HTTP set-cookie header will make life a little more complicated for most potential bot creators. But it’s not so difficult to prevent the bot from manually setting a cookie as soon as the developer finds out how to create the same value that javascript generates.
IP address
An IP address alone will not tell you if the user is human. Some sites use IP addresses to try to detect bots, although it is true that a simple bot can appear as a bunch of requests from the same IP address. But IP addresses are cheap, and with Amazon EC2 or similar cloud services, you can create a server and use it as a proxy server. Or enter 10 or 100 and use them as a proxy.
UserAgent String
It is so easy to manipulate in a crawler that you cannot count on it to mark a bot that is trying to not be detected. It’s easy to install UserAgent on the same line as one of the main browser posts, and can even rotate between multiple browsers.
Complex markup
The most difficult site I ever wrote a bot on consisted of frames inside frames inside frames ... about 10 layers on each page, where each src frame was the same controller database, but had different parameters as to which perform actions. The order of the actions was important, so it was difficult to keep everything that was happening, but in the end (after a week or so) my bot worked, therefore, although this may restrain some botters, it will not be useful against everything. And, most likely, your site will be more difficult to maintain.
Disclaimer and conclusion
Not all bots are bad. Most of the scanners / bots that I made were intended for users who wanted to automate certain processes on the site, such as entering data that was too tedious to do manually. So doing tedious tasks is easy! Or provide an API for your users. Probably one of the easiest ways to prevent someone from writing a bot for your site is to provide access to the API. If you provide an API, it is much less likely that someone will try to create a crawler for it. And you can use the API keys to control how much someone uses it.
In order to prevent spammers, the most effective approach is likely to include a combination of captcha combinations and checking accounts by cell numbers or credit cards. Add some log analysis to identify and disable any malicious personalized bots, and you should be in pretty good shape.