Do I support UTF-8 correctly in my PHP applications?

I would like to make sure that everything I know about UTF-8 is correct. I have been trying to use UTF-8 for a while, but I keep stumbling over more and more errors and other strange things that make it almost impossible to have a 100% UTF-8 site. Somewhere there is always something that I seem to miss. Maybe someone here can fix my list or OK, so I won’t miss anything important.

Database

Each site must store data somewhere. Regardless of your PHP settings, you must also configure the database. If you cannot access the configuration files, make sure that as soon as you connect, “ SET TITLE is“ utf8 . Also, be sure to use utf8_ unicode_ ci for all your tables. This assumes MySQL for the database, you will have to change to others.

Regex

I make regular expression LOT more complicated than your average search. I must remember to use the "/ u" modifier so that PCRE does not distort my lines . However, even then there are still problems, apparently .

String functions

All string functions by default (strlen (), strpos (), etc.) should be replaced with Multibyte string functions that look in a character instead of a byte.

Headers You must ensure that your server returns the correct header for the browser to find out which encoding you are trying to use (just like you should tell MySQL).

header ('Content-Type: text / html; encoding = UTF-8');

It's also a good idea to put the correct <meta> in the page title. Although the actual title will override this if they are different.

<meta http-equiv="Content-Type" content="text/html;charset=utf-8"> 

Questions

Do I need to convert everything that I get from the user agent (HTML form and URI) to UTF-8 when the page loads, or if I can just leave the strings / values ​​as they are and still run them through are these functions no problem?

If I need to convert everything to UTF-8, then what steps should I take? mb_detect_encoding seems to be built for this, but I can still see that people complain that it doesn't always work. mb_check_encoding also seems to be a problem talking about a good UTF-8 line from a garbled one.

Does PHP store strings in memory differently depending on what encoding it uses (e.g. file types), or is it still stored as a regular sting, some of the characters being interpreted differently (e.g. vs and in HTML) . chazomaticus answers this question:

In PHP (before PHP5, anyway), strings are just sequences of bytes. There is an implied or explicit set of characters associated with them; this is something a programmer needs to keep track of.

If a gives a non-UTF-8 string to the mb_ * function, will it ever cause a problem?

If the UTF string is incorrectly encoded, something will go wrong (for example, a parsing error in the regular expression?) Or will it just mark the object as bad (html)? Is it likely that incorrectly encoded strings will return a FALSE function because the string is bad?

I heard that you should also specify your forms as UTF-8 (accept-charset = "UTF-8"), but I'm not sure if this is an advantage ..?

Was UTF-16 written to limit the limit in UTF-8? How does UTF-8 end for characters? (U2 (UTF) K?)

Functions

Here are a couple of custom PHP functions that I found, but I have no way to verify that they really work. Maybe someone has an example that I can use. First convertToUTF8 () , and then seem_utf8 from wordpress.

 function seems_utf8($str) { $length = strlen($str); for ($i=0; $i < $length; $i++) { $c = ord($str[$i]); if ($c < 0x80) $n = 0; # 0bbbbbbb elseif (($c & 0xE0) == 0xC0) $n=1; # 110bbbbb elseif (($c & 0xF0) == 0xE0) $n=2; # 1110bbbb elseif (($c & 0xF8) == 0xF0) $n=3; # 11110bbb elseif (($c & 0xFC) == 0xF8) $n=4; # 111110bb elseif (($c & 0xFE) == 0xFC) $n=5; # 1111110b else return false; # Does not match any model for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ? if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80)) return false; } } return true; } function is_utf8($str) { $c=0; $b=0; $bits=0; $len=strlen($str); for($i=0; $i<$len; $i++){ $c=ord($str[$i]); if($c > 128){ if(($c >= 254)) return false; elseif($c >= 252) $bits=6; elseif($c >= 248) $bits=5; elseif($c >= 240) $bits=4; elseif($c >= 224) $bits=3; elseif($c >= 192) $bits=2; else return false; if(($i+$bits) > $len) return false; while($bits > 1){ $i++; $b=ord($str[$i]); if($b < 128 || $b > 191) return false; $bits--; } } } return true; } 

If anyone is interested, I found a great example page to use when testing UTf-8 .

+40
php unicode utf-8
Aug 22 '09 at 10:01
source share
5 answers

Do I need to convert everything that I get from a user agent (HTML form and URI) to UTF-8 when the page loads

No. The user agent must send data in UTF-8 format; if you do not lose the benefit of Unicode.

The way to ensure that the user agent is sent in UTF-8 format is to serve a page containing the form that it sends in UTF-8 encoding. Use the Content-Type header (and meta http-equiv too, if you intend to save the form and work autonomously).

I heard that you should also mark your forms as UTF-8 (accept-charset = "UTF-8")

not to do. It was a good idea in the HTML standard, but IE did not understand. It was supposed to specify an exclusive list of valid encodings, but IE treats it as a list of additional encodings to try based on each field. Therefore, if you have an ISO-8859-1 page and the form "accept-charset =" UTF-8 ", IE will first try to encode the field as ISO-8859-1, and if there is a non-8859-1 character, then it will resort to UTF-8.

But since IE is not telling you that it used ISO-8859-1 or UTF-8, this is absolutely useless to you. You would have to guess, for each field separately, what encoding was used! Not helpful. Omit the attribute and show your pages as UTF-8; what's the best thing you can do at the moment.

If the UTF string is incorrectly encoded, something will go wrong.

If you allow this sequence to go through the browser, you may have problems. There are “overlapping sequences” that encode a low-numbered code point in a longer sequence of bytes than necessary. This means that if you filter '<by looking for this ASCII character in a sequence of bytes, you can skip it and let the script element into what you thought was safe text.

In the early days of Unicode, unnecessary sequences were canceled, but Microsoft took a very long time to collect their crap: IE interprets the byte sequence '\ xC0 \ xBC as' <before IE6 Service Pack 1. Opera also made a mistake before version (approximately, I think) version 7. Fortunately, these old browsers die out, but it is still worth filtering sequences with overlapping if these browsers are still now (or the new idiot browsers make the same mistake in the future). You can do this and fix other unsuccessful sequences, with a regex that only allows the use of the correct UTF-8, like this one from W3.

If you use the mb_ functions in PHP, you can be isolated from these problems. I cannot say for sure that mb_ * was unusable when I was still writing PHP.

In any case, this is also a good time to remove control characters, which are a large and generally underestimated source of errors. I would remove characters 9 and 13 from the supplied string in addition to the rest, which is caused by the regular expression W3; it’s also worth deleting simple lines for lines that, as you know, should not be multi-line text fields.

Was UTF-16 written to limit the limit in UTF-8?

No, UTF-16 is an encoding with two bytes per code point, which is used to simplify indexing Unicode strings in memory (since all Unicode fits in two bytes, systems like Windows and Java still do this So). Unlike UTF-8, it is incompatible with ASCII and is practically not used on the Internet. But you sometimes see it in saved files, usually saved by Windows users, which were misled by the description of Windows UTF-16LE as "Unicode" in the Save-As menu.

seems_utf8

This is very inefficient compared to regex!

Also, be sure to use utf8_unicode_ci for all of your tables.

In fact, you can get rid of it without considering MySQL as storage only for bytes and only interpreting them as UTF-8 in your script. The advantage of using utf8_unicode_ci is that it will match (sort and make case insensitive comparisons) knowledge of non-ASCII characters, for example. "Ŕ" and "Ŕ" are one and the same character. If you use non-UTF8 matching, you must adhere to binary (case sensitive) matching.

Whatever you choose, execute it sequentially: use the same character set for your tables as for your connection. What you want to avoid is lossy character set conversion between your scripts and the database.

+20
Aug 22 '09 at 23:23
source share

Most of what you do now should be right.

Some notes: any sort utf_* sorting in MySQL will store your data correctly as UTF-8, the only difference between them is the sorting (alphabetical order) used when sorting.

You can tell Apache and PHP to set the correct charset headers AddDefaultCharset utf-8 in httpd.conf / .htaccess and default_charset = "utf-8" in php.ini respectively.

You can specify the mbstring extension to take care of string functions. This works for me:

 mbstring.internal_encoding=utf-8 mbstring.http_output=UTF-8 mbstring.encoding_translation=On mbstring.func_overload=6 

(this leaves the mail( ) function untouched - I found that I set it to 7 played out using my mail headers)

To convert charset, see https://sourceforge.net/projects/phputf8/ .

PHP does not care at all that in a variable, it simply stores and receives its contents blindly.

You will get unexpected results if you declare one mbstring.internal_encoding and put the lines of the mb_ * functions in a different encoding. You can also safely send ASCII in utf-8 functions.

If you are worried that someone is publishing incorrectly encoded material, I believe you want to consider HTML Purifie r to filter the GET / POST data before processing.

Accept-charset been in the specification ever since, but its real support in browsers is more or less null. The browser will typically use the encoding on the page containing the form.

UTF-16 is not the big brother of UTF-8, it just serves a different purpose.

+11
Aug 22 '09 at 22:58
source share

database / mysql: If you use SET NAMES and for example php / mysql you leave mysql_real_escape_string () in the dark about changing character encoding. This may lead to incorrect results. So, if you rely on an escape function such as mysql_real_escape_string (because you are not using prepared statements) SET NAMES is a suboptimal solution. This is why mysql_set_charset () was introduced or why gentoo applies a patch that adds the mysql.connect_charset configuration parameter for php / mysql and php / MySQLi.

The client usually does not specify the encoding of the parameters that it sends. If you expect utf-8 data to be encoded and treat it as such , there may be coding errors (byte sequences that are not valid in utf-8). Thus, the data may not be displayed as expected, or the parser may interrupt the parsing. But at least user input cannot "escape" and do more harm, for example. in the sql built-in statement or html output. For example. take a script (saved as iso-8859-1 or utf-8, it doesn't matter)

 <?php $s = 'abcxyz'; var_dump(htmlspecialchars($s, ENT_QUOTES, 'utf-8')); // adding the byte sequence for äöü in iso-8859-1 $s = 'abc'. chr(0xE4) . chr(0xF6) . chr(0xFC). 'xyz'; var_dump(htmlspecialchars($s, ENT_QUOTES, 'utf-8')); 

prints

 string(6) "abcxyz" string(0) "" 

E4F6FC is not a valid utf-8 byte sequence, so htmlspecialchars returns an empty string. Other features may come back? or other "special" character. But at least they will not “mistakenly” perceive the character as a malicious control character - if they all adhere to the “correct” encoding (in this case, utf-8).

accept-charset does not guarantee that you will only receive data with this encoding. As far as you know, the client may not even “use” / analyze your html document containing the form element. This can help, and there is no reason why you should not set this attribute. But it is not "reliable."

+3
Aug 22 '09 at 22:45
source share

UTF-8 is in order and does not have restrictions allowed by UTF-16. PHP does not change the way strings are stored in memory (unlike Python). If the entire data stream uses UTF-8 (web forms receive UTF-8 data, the tables use utf8 encoding, and you use SET NAMES utf8 , and the data is saved unchanged (without encoding conversion), this should be good.

0
Aug 22 '09 at 22:14
source share

For custom form inputs, I add this attribute to the form tags: accept-charset="utf-8" . Thus, the data you receive should always be encoded in utf-8.

0
Aug 22 '09 at 22:38
source share



All Articles