Other answers and comments say that programming is not the best solution to this problem. I agree with them. These answers should be transferred to Moderators - Stack Exchange or Webmasters - Stack Exchange .
Since this is stackoverflow, my answer will be based on computer programming.
If you want to use str_replace do something like this. For the sake of this post, since some people are offended by real red words, let them pretend that these are bad words: 'fug', 'schnitt', 'dam'.
$text = str_ireplace(" fug ","(Offensive words detected & removed!)",$text);
Please note: str_ireplace not str_replace . i is for case insensitive. But this will erroneously match "fuggedaboudit", for example.
If you want to do a more reliable job, you need to use a regular expression.
$bad_text = "Fug dis schnitt, because a schnitter never dam wins a fuggin schnitting darn"; $hit_words = array("fug","schnitt","dam"); // these words are 'hits' that we need to replace. hit words... array_walk($hit_words, function(&$value, $key) { // this prepares the regex, requires PHP 5.3+ I think. $value = '~\b' . preg_quote( $value ,'~') . '\b~i'; // \b means word boundary, like space, line-break, period, dash, and many others. Prevends "refudgee" from being matched when searching for "fudge" }); /*print_r($bad_words);*/ $good_words = array("fudge","shoot","dang"); $good_text = preg_replace($hit_words,$good_words,$bad_text); // does all search/replace actions at once echo '<br />' . $good_text . '<br />';
This will do all your searches / replacements at once. These two arrays must contain the same number of elements, matching the search and replacing the terms. It will not correspond to parts of words, only whole words. And, of course, certain castes will find ways to make them swear on your site. But that will stop the lazy shacks.
I decided to add some links to sites that obviously use programming to make the first run, eliminating profanity. I will add more when I stumbled upon them. Except yahoo:
1.) Dell.com - replace the corresponding words with <profanity deleted> . http://en.community.dell.com/support-forums/peripherals/f/3529/t/19502072.aspx
2.) Watson, a supercomputer, apparently developed the curse problem. How do you tell the difference between scourge and slang? It seems to be so complicated that the researchers simply decided to clear it all. But they could just use a list of cursed words (exact coincidence is a subset of the regular expression, I would say) and forbid their use. One way or another, how it works in real life. Watson Develops Profanity
3.) the section Compliance with the content of Gmail user settings in business applications:
- Add expressions that describe the content you want to search in each post
The "Expresssions" used can be of several types, including the "Advanced content match", which, among other things, allows you to select the "Match type" options, very similar to what you would have in the excel filter: Starts with, Ends with, Contains, Not contains, Equals, Is Empty, all of which are allegedly using Regex. But wait, there’s still: Matches the regular expression, Matches the regular expression, Matches any word, Matches all words. Thus, the mighty Google implements regular expression filtering options for its business users. Why do this when regex is supposedly so inefficient? Because it is really quite effective. This is a simple, fast, programming solution that will fail only when people struggle to get around it.
Beyond this list, I wonder if anyone else has noticed a similarity between proxy and spam filtering. Obviously, regex is used in both arenas, but nitpickers who learned that “all regex is bad” will always lower any answer to any question, even if regex is mentioned. Try googling "how spam filters work." You will get results similar to the one that spam killer covers: http://www.seas.upenn.edu/cets/answers/spamblock-filter.html
Another example, when I'm sure that regex is used when communicating through Amazon.com Amazon Marketplace. You receive emails at your regular email address. Therefore, naturally, when you reply to the seller, your email program will include all kinds of sender information, such as your email address, email addresses and everything that you enter into the body. But Amazon.com removes them "for your protection." Can I find a way around this regex? Probably, but it will require more problems than what is worth, and therefore effective to a certain extent. They also store emails for 2 years, apparently so that a person can go through them in case of any allegations of fraud.
SpamAssassin also considers the subject and body of the message for the same things that a person notices when a message is "like spam." He searches for strings like "viagra", "buy now", "lowest prices", "click here", etc. He is also looking for flashy HTML such as large fonts, blinking text, vibrant colors, etc.
Regex is not mentioned, but I'm sure it is being used.