The accept-charset="UTF-8" attribute is just a guideline for browsers that they donβt obey, so they are not forced to report that in this way, crappy bot submission forms are a good example ...
What I usually do is ignore bad characters, either through iconv() or with less reliable utf8_encode() / utf8_decode() , if you use iconv , you also have the option to transliterate bad characters.
Here is an example using iconv() :
$str_ignore = iconv('UTF-8', 'UTF-8//IGNORE', $str); $str_translit = iconv('UTF-8', 'UTF-8//TRANSLIT', $str);
If you want to display an error message to your users, I will probably do it globally, instead of getting the resulting value, something like this will probably be very good:
function utf8_clean($str) { return iconv('UTF-8', 'UTF-8//IGNORE', $str); } $clean_GET = array_map('utf8_clean', $_GET); if (serialize($_GET) != serialize($clean_GET)) { $_GET = $clean_GET; $error_msg = 'Your data is not valid UTF-8 and has been stripped.'; }
You can also normalize newlines and stripes with (un) visible control characters, for example:
function Clean($string, $control = true) { $string = iconv('UTF-8', 'UTF-8//IGNORE', $string); if ($control === true) { return preg_replace('~\p{C}+~u', '', $string); } return preg_replace(array('~\r\n?~', '~[^\P{C}\t\n]+~u'), array("\n", ''), $string); }
Code to convert from UTF-8 to Unicode:
function Codepoint($char) { $result = null; $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char)); if (is_array($codepoint) && array_key_exists(1, $codepoint)) { $result = sprintf('U+%04X', $codepoint[1]); } return $result; } echo Codepoint('Γ ');
Probably faster than any other alternative, have not tested it extensively, though.
Example:
$string = 'hello world '; // U+FFFEhello worldU+FFFD echo preg_replace_callback('/[\p{So}\p{Cf}\p{Co}\p{Cs}\p{Cn}]/u', 'Bad_Codepoint', $string); function Bad_Codepoint($string) { $result = array(); foreach ((array) $string as $char) { $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char)); if (is_array($codepoint) && array_key_exists(1, $codepoint)) { $result[] = sprintf('U+%04X', $codepoint[1]); } } return implode('', $result); }
Is this what you were looking for?