How to handle user input of invalid UTF-8 characters?

I am looking for a general strategy / recommendation on how to handle UTF-8 invalid input from users.

Although my webapp uses UTF-8, some users enter invalid characters. This causes errors in PHP json_encode () , and overall it seems like a bad idea to have around.

W3C I18N Frequently Asked Questions: Multilingual forms read: "If data other than UTF-8 is received, the error message must be sent back."

  • How exactly should this be done on virtually the entire site with dozens of different places where data can be entered?
  • How do you present the error in a useful way for the user?
  • How to temporarily store and display bad form data so that the user does not lose all his text? Smash bad characters? Use a replacement character and how?
  • For existing data in the database, when invalid UTF-8 data is detected, I should try to convert it and save it back (like? Utf8_encode ()? Mb_convert_encoding () ?) Or leave it as it is in the database, but do something (what?) before json_encode ()?

EDIT: I am very familiar with the mbstring extension and do not ask: "How does UTF-8 work in PHP." I would like to get advice from people with experience in real situations, how they dealt with it.

EDIT2: As part of the solution, I would really like to see a quick way to convert invalid characters to U + FFFD

+31
php encoding utf-8
Sep 15
source share
8 answers

The accept-charset="UTF-8" attribute is just a guideline for browsers that they don’t obey, so they are not forced to report that in this way, crappy bot submission forms are a good example ...

What I usually do is ignore bad characters, either through iconv() or with less reliable utf8_encode() / utf8_decode() , if you use iconv , you also have the option to transliterate bad characters.

Here is an example using iconv() :

 $str_ignore = iconv('UTF-8', 'UTF-8//IGNORE', $str); $str_translit = iconv('UTF-8', 'UTF-8//TRANSLIT', $str); 

If you want to display an error message to your users, I will probably do it globally, instead of getting the resulting value, something like this will probably be very good:

 function utf8_clean($str) { return iconv('UTF-8', 'UTF-8//IGNORE', $str); } $clean_GET = array_map('utf8_clean', $_GET); if (serialize($_GET) != serialize($clean_GET)) { $_GET = $clean_GET; $error_msg = 'Your data is not valid UTF-8 and has been stripped.'; } // $_GET is clean! 

You can also normalize newlines and stripes with (un) visible control characters, for example:

 function Clean($string, $control = true) { $string = iconv('UTF-8', 'UTF-8//IGNORE', $string); if ($control === true) { return preg_replace('~\p{C}+~u', '', $string); } return preg_replace(array('~\r\n?~', '~[^\P{C}\t\n]+~u'), array("\n", ''), $string); } 



Code to convert from UTF-8 to Unicode:

 function Codepoint($char) { $result = null; $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char)); if (is_array($codepoint) && array_key_exists(1, $codepoint)) { $result = sprintf('U+%04X', $codepoint[1]); } return $result; } echo Codepoint('à'); // U+00E0 echo Codepoint('ひ'); // U+3072 

Probably faster than any other alternative, have not tested it extensively, though.




Example:

 $string = 'hello world '; // U+FFFEhello worldU+FFFD echo preg_replace_callback('/[\p{So}\p{Cf}\p{Co}\p{Cs}\p{Cn}]/u', 'Bad_Codepoint', $string); function Bad_Codepoint($string) { $result = array(); foreach ((array) $string as $char) { $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char)); if (is_array($codepoint) && array_key_exists(1, $codepoint)) { $result[] = sprintf('U+%04X', $codepoint[1]); } } return implode('', $result); } 

Is this what you were looking for?

+54
Sep 18 '10 at 18:16
source share
β€” -

Getting invalid characters from your web application may be related to character sets accepted for HTML forms. You can specify which character set to use for forms with the accept-charset attribute:

 <form action="..." accept-charset="UTF-8"> 

You can also take a look at similar questions in StackOverflow for pointers on how to handle invalid characters, for example. the ones listed in the column on the right, but I think that signaling an error to the user is better than trying to clear these invalid characters that cause an unexpected loss of significant data or an unexpected change in your user inputs.

+4
Sep 15 '10 at 6:56
source share

I put together a fairly simple class to check if the input is in UTF-8 and is executed via utf8_encode() , if necessary:

 class utf8 { /** * @param array $data * @param int $options * @return array */ public static function encode(array $data) { foreach ($data as $key=>$val) { if (is_array($val)) { $data[$key] = self::encode($val, $options); } else { if (false === self::check($val)) { $data[$key] = utf8_encode($val); } } } return $data; } /** * Regular expression to test a string is UTF8 encoded * * RFC3629 * * @param string $string The string to be tested * @return bool * * @link http://www.w3.org/International/questions/qa-forms-utf-8.en.php */ public static function check($string) { return preg_match('%^(?: [\x09\x0A\x0D\x20-\x7E] # ASCII | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )*$%xs', $string); } } // For example $data = utf8::encode($_POST); 
+2
Sep 21 '10 at 16:03
source share

There is a multi-bit extension for PHP, check it out: http://www.php.net/manual/en/book.mbstring.php

You should try mb_check_encoding () .

Good luck

+1
Sep 15 '10 at 6:50
source share

I recommend simply not allowing garbage to enter. Do not rely on user-defined functions that can drown your system. Just go through the data presented against the alphabet you created. Create an acceptable alphabet string and go byte by byte as if it were an array. Insert valid characters on a new line and omit unacceptable characters. The data that you store in your database is then user-initiated data, but not actually user-provided data.

EDIT No. 4: Replacing a bad character with entiy: & # 65533;

EDIT No. 3: Updated: September 22, 2010, 13:32 Reason: Now the UTF-8 string has returned, plus I used the test file that you provided as proof.

 <?php // build alphabet // optionally you can remove characters from this array $alpha[]= chr(0); // null $alpha[]= chr(9); // tab $alpha[]= chr(10); // new line $alpha[]= chr(11); // tab $alpha[]= chr(13); // carriage return for ($i = 32; $i <= 126; $i++) { $alpha[]= chr($i); } /* remove comment to check ascii ordinals */ // /* // foreach ($alpha as $key=>$val){ // print ord($val); // print '<br/>'; // } // print '<hr/>'; //*/ // // //test case #1 // // $str = 'afsjdfhasjhdgljhasdlfy42we875y342q8957y2wkjrgSAHKDJgfcv kzXnxbnSXbcv '.chr(160).chr(127).chr(126); // // $string = teststr($alpha,$str); // print $string; // print '<hr/>'; // // //test case #2 // // $str = ''.'Β©?β„’???'; // $string = teststr($alpha,$str); // print $string; // print '<hr/>'; // // $str = 'Β©'; // $string = teststr($alpha,$str); // print $string; // print '<hr/>'; $file = 'http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt'; $testfile = implode(chr(10),file($file)); $string = teststr($alpha,$testfile); print $string; print '<hr/>'; function teststr(&$alpha, &$str){ $strlen = strlen($str); $newstr = chr(0); //null $x = 0; if($strlen >= 2){ for ($i = 0; $i < $strlen; $i++) { $x++; if(in_array($str[$i],$alpha)){ // passed $newstr .= $str[$i]; }else{ // failed print 'Found out of scope character. (ASCII: '.ord($str[$i]).')'; print '<br/>'; $newstr .= '&#65533;'; } } }elseif($strlen <= 0){ // failed to qualify for test print 'Non-existent.'; }elseif($strlen === 1){ $x++; if(in_array($str,$alpha)){ // passed $newstr = $str; }else{ // failed print 'Total character failed to qualify.'; $newstr = '&#65533;'; } }else{ print 'Non-existent (scope).'; } if(mb_detect_encoding($newstr, "UTF-8") == "UTF-8"){ // skip }else{ $newstr = utf8_encode($newstr); } // test encoding: if(mb_detect_encoding($newstr, "UTF-8")=="UTF-8"){ print 'UTF-8 :D<br/>'; }else{ print 'ENCODED: '.mb_detect_encoding($newstr, "UTF-8").'<br/>'; } return $newstr.' (scope: '.$x.', '.$strlen.')'; } 
+1
Sep 20 '10 at 13:49
source share

For the completeness of this question (not necessarily the best answer) ...

 function as_utf8($s) { return mb_convert_encoding($s, "UTF-8", mb_detect_encoding($s)); } 
+1
Sep 25 '10 at 1:24
source share

How to hide all characters outside your subset. At least in some parts of my application I do not allow the use of characters outside of [aZ] [0-9 sets], for example, usernames. You can create a filter function that silently deletes all characters outside this range or returns an error if it detects them and issues a solution to the user.

0
Sep 15 '10 at 7:07
source share

Try to do what Rails does to make all browsers always publish UTF-8 data:

 <form accept-charset="UTF-8" action="#{action}" method="post"><div style="margin:0;padding:0;display:inline"> <input name="utf8" type="hidden" value="&#x2713;" /> </div> <!-- form fields --> </form> 

See railssnowman.info or the initial patch for an explanation.

  • For the browser to send the form submission data in UTF-8 encoding, simply draw a page with the Content-Type heading "text / html; charset = utf-8" (or use the meta http-equiv tag).
  • For the browser to send form submission data in UTF-8 encoding, even if the user is busy with page encoding (browsers allow users to do this), use accept-charset="UTF-8" in the form.
  • For the browser to send the form submission data in UTF-8 encoding, even if the user is busy with page encoding (browsers allow users to do this), and even if the browser is IE, and the user switches the page encoding to Korean and enter Korean characters in the form fields, add hidden form input with value, for example &#x2713; , which can only be from Unicode encoding (and in this example, not Korean encoding).
0
Sep 15 2018-10-15T00:
source share



All Articles