Do I need to convert everything that I get from a user agent (HTML form and URI) to UTF-8 when the page loads
No. The user agent must send data in UTF-8 format; if you do not lose the benefit of Unicode.
The way to ensure that the user agent is sent in UTF-8 format is to serve a page containing the form that it sends in UTF-8 encoding. Use the Content-Type header (and meta http-equiv too, if you intend to save the form and work autonomously).
I heard that you should also mark your forms as UTF-8 (accept-charset = "UTF-8")
not to do. It was a good idea in the HTML standard, but IE did not understand. It was supposed to specify an exclusive list of valid encodings, but IE treats it as a list of additional encodings to try based on each field. Therefore, if you have an ISO-8859-1 page and the form "accept-charset =" UTF-8 ", IE will first try to encode the field as ISO-8859-1, and if there is a non-8859-1 character, then it will resort to UTF-8.
But since IE is not telling you that it used ISO-8859-1 or UTF-8, this is absolutely useless to you. You would have to guess, for each field separately, what encoding was used! Not helpful. Omit the attribute and show your pages as UTF-8; what's the best thing you can do at the moment.
If the UTF string is incorrectly encoded, something will go wrong.
If you allow this sequence to go through the browser, you may have problems. There are “overlapping sequences” that encode a low-numbered code point in a longer sequence of bytes than necessary. This means that if you filter '<by looking for this ASCII character in a sequence of bytes, you can skip it and let the script element into what you thought was safe text.
In the early days of Unicode, unnecessary sequences were canceled, but Microsoft took a very long time to collect their crap: IE interprets the byte sequence '\ xC0 \ xBC as' <before IE6 Service Pack 1. Opera also made a mistake before version (approximately, I think) version 7. Fortunately, these old browsers die out, but it is still worth filtering sequences with overlapping if these browsers are still now (or the new idiot browsers make the same mistake in the future). You can do this and fix other unsuccessful sequences, with a regex that only allows the use of the correct UTF-8, like this one from W3.
If you use the mb_ functions in PHP, you can be isolated from these problems. I cannot say for sure that mb_ * was unusable when I was still writing PHP.
In any case, this is also a good time to remove control characters, which are a large and generally underestimated source of errors. I would remove characters 9 and 13 from the supplied string in addition to the rest, which is caused by the regular expression W3; it’s also worth deleting simple lines for lines that, as you know, should not be multi-line text fields.
Was UTF-16 written to limit the limit in UTF-8?
No, UTF-16 is an encoding with two bytes per code point, which is used to simplify indexing Unicode strings in memory (since all Unicode fits in two bytes, systems like Windows and Java still do this So). Unlike UTF-8, it is incompatible with ASCII and is practically not used on the Internet. But you sometimes see it in saved files, usually saved by Windows users, which were misled by the description of Windows UTF-16LE as "Unicode" in the Save-As menu.
seems_utf8
This is very inefficient compared to regex!
Also, be sure to use utf8_unicode_ci for all of your tables.
In fact, you can get rid of it without considering MySQL as storage only for bytes and only interpreting them as UTF-8 in your script. The advantage of using utf8_unicode_ci is that it will match (sort and make case insensitive comparisons) knowledge of non-ASCII characters, for example. "Ŕ" and "Ŕ" are one and the same character. If you use non-UTF8 matching, you must adhere to binary (case sensitive) matching.
Whatever you choose, execute it sequentially: use the same character set for your tables as for your connection. What you want to avoid is lossy character set conversion between your scripts and the database.