How to remove offensive words from php message?

I use the following code to replace offensive words -

$text = str_replace("f***","(Offensive words detected & removed!)",$text); 

This code will replace "fuck" with "(Offensive words found and deleted!)".

But the problem is the “case”, if someone prints FUCK, my code will not be able to detect it. How to solve it?

+4
source share
6 answers

No matter what you do, users will find ways to get around your filters. They will use unicode characters ( ss , for example, uses the Cyrillic alphabet and will not be captured by any of the regex solutions). They will use spaces, dollar signs, asterisks, everything that you still have not managed to catch.

If family friendliness is important to your application, ask someone to watch the content before it goes live. Otherwise, add a flag function so that other people can mark offensive content. Better yet, use some kind of machine learning or a Bayesian filter to automatically flag potentially abusive messages and let people check them manually. People read human languages ​​better than computers.

+10
source

The problem with whitelists / blacklists - as other users indicate - your users will give priority to finding ways around your filter to satisfy, rather than using your site for whatever it was intended for, whatever it is.

One approach would be to use the Google’s undocumented profanity API, created for "What do you like?" Web site. If you get a true response, just tell the user a message that their message could not be sent due to profanity detected.

You can approach this as follows:

 <?php if (isset($_POST['submit'])) { $result = json_decode(file_get_contents(sprintf('http://www.wdyl.com/profanity?q=%s', urlencode($_POST['comments'])))); if ($result->response == true) { // profanity detected } else { // save comments to database as normal } } 
+4
source

Other answers and comments say that programming is not the best solution to this problem. I agree with them. These answers should be transferred to Moderators - Stack Exchange or Webmasters - Stack Exchange .

Since this is stackoverflow, my answer will be based on computer programming.

If you want to use str_replace do something like this. For the sake of this post, since some people are offended by real red words, let them pretend that these are bad words: 'fug', 'schnitt', 'dam'.

 $text = str_ireplace(" fug ","(Offensive words detected & removed!)",$text); 

Please note: str_ireplace not str_replace . i is for case insensitive. But this will erroneously match "fuggedaboudit", for example.

If you want to do a more reliable job, you need to use a regular expression.

 $bad_text = "Fug dis schnitt, because a schnitter never dam wins a fuggin schnitting darn"; $hit_words = array("fug","schnitt","dam"); // these words are 'hits' that we need to replace. hit words... array_walk($hit_words, function(&$value, $key) { // this prepares the regex, requires PHP 5.3+ I think. $value = '~\b' . preg_quote( $value ,'~') . '\b~i'; // \b means word boundary, like space, line-break, period, dash, and many others. Prevends "refudgee" from being matched when searching for "fudge" }); /*print_r($bad_words);*/ $good_words = array("fudge","shoot","dang"); $good_text = preg_replace($hit_words,$good_words,$bad_text); // does all search/replace actions at once echo '<br />' . $good_text . '<br />'; 

This will do all your searches / replacements at once. These two arrays must contain the same number of elements, matching the search and replacing the terms. It will not correspond to parts of words, only whole words. And, of course, certain castes will find ways to make them swear on your site. But that will stop the lazy shacks.

I decided to add some links to sites that obviously use programming to make the first run, eliminating profanity. I will add more when I stumbled upon them. Except yahoo:

1.) Dell.com - replace the corresponding words with <profanity deleted> . http://en.community.dell.com/support-forums/peripherals/f/3529/t/19502072.aspx

2.) Watson, a supercomputer, apparently developed the curse problem. How do you tell the difference between scourge and slang? It seems to be so complicated that the researchers simply decided to clear it all. But they could just use a list of cursed words (exact coincidence is a subset of the regular expression, I would say) and forbid their use. One way or another, how it works in real life. Watson Develops Profanity

3.) the section Compliance with the content of Gmail user settings in business applications:

  1. Add expressions that describe the content you want to search in each post

The "Expresssions" used can be of several types, including the "Advanced content match", which, among other things, allows you to select the "Match type" options, very similar to what you would have in the excel filter: Starts with, Ends with, Contains, Not contains, Equals, Is Empty, all of which are allegedly using Regex. But wait, there’s still: Matches the regular expression, Matches the regular expression, Matches any word, Matches all words. Thus, the mighty Google implements regular expression filtering options for its business users. Why do this when regex is supposedly so inefficient? Because it is really quite effective. This is a simple, fast, programming solution that will fail only when people struggle to get around it.

Beyond this list, I wonder if anyone else has noticed a similarity between proxy and spam filtering. Obviously, regex is used in both arenas, but nitpickers who learned that “all regex is bad” will always lower any answer to any question, even if regex is mentioned. Try googling "how spam filters work." You will get results similar to the one that spam killer covers: http://www.seas.upenn.edu/cets/answers/spamblock-filter.html

Another example, when I'm sure that regex is used when communicating through Amazon.com Amazon Marketplace. You receive emails at your regular email address. Therefore, naturally, when you reply to the seller, your email program will include all kinds of sender information, such as your email address, email addresses and everything that you enter into the body. But Amazon.com removes them "for your protection." Can I find a way around this regex? Probably, but it will require more problems than what is worth, and therefore effective to a certain extent. They also store emails for 2 years, apparently so that a person can go through them in case of any allegations of fraud.

SpamAssassin also considers the subject and body of the message for the same things that a person notices when a message is "like spam." He searches for strings like "viagra", "buy now", "lowest prices", "click here", etc. He is also looking for flashy HTML such as large fonts, blinking text, vibrant colors, etc.

Regex is not mentioned, but I'm sure it is being used.

+1
source

Use str_ireplace , which is case insensitive to str_replace () version

 $text = str_ireplace("flip","(Offensive words detected & removed!)", $text); 
0
source

You should use regular expression replacements and add the i flag at the end of your regular expression so that it looks for your text regardless of the case. so..

 $text = preg_replace("/fuck/i","(Offensive words detected & removed!)", $text); 

str_ireplace can also be used if you do not need complex regex rules.

 $text = str_ireplace("fuck","(Offensive words detected & removed!)", $text); 

In fact, the latter is the preferred method, as it is faster than regular expression manipulation. From the PHP docs:

If you do not need new replacement rules, you should usually use this function instead of preg_replace () with the i modifier.

BUT, as the commentator noted, simple line / regular expression replacements can break your lines if the substring that you are replacing appears as part of another non-aggressive word. To do this, you can use word boundaries in your regular expressions or replace only those words that cannot be part of other lines (for example, the word fuck ).

0
source

Use 'str_ireplace' to replace any case sensitive strings. This will probably help you.

 $text = 'contains offensive_word .... so on'; $array = array( 'offensive_word' => '****', 'offensive_word2' => '****', 'offensive_word3' => '****', //..... ); $text = str_ireplace(array_keys($array),array_values($array), $text); echo $text; 
0
source

Source: https://habr.com/ru/post/1498141/


All Articles