Unicode normalization according to W3C in PHP

Question

Unicode normalization according to W3C in PHP

When checking the HTML code of your website in the W3C validator, I received the following warning:

Line 157, Column 220: Text run is not in Unicode Normalization Form C. …i͈̭̋ͥ̂̿̄̋̆ͣv̜̺̋̽͛̉͐̀͌̚e͖̼̱ͣ̓ͫ͆̍̄̍͘-̩̬̰̮̯͇̯͆̌ͨ́͌ṁ̸͖̹͎̱̙̱͟͡i̷̡͌͂͏̘̭̥̯̟n̏͐͌̑̄̃͘͞…

I am developing it in PHP 5.3.x, so I can use the Normalizer class.

So, to fix this, should I use Normalizer::normalize($output) when displaying any user input (like a comment), or should I use Normalizer::normalize($input) for any user input before saving it to database?

tl; dr: should I use Unicode normalization before storing user input in the database or only when displaying it?

+4

php unicode web-standards normalization unicode-normalization

federicot Jan 7 '12 at 1:52

source share

2 answers

Strictly speaking, the rules of the web symbol model are not just normalized to NFC, but both form and form must be executed after any technology that includes text from another mechanism in NFC. An example will include XML, character references, and entity references. For example, ä will not match the symbol model while it is in the NFC extension, the symbol reference turns it into a followed by the union of diarrhea, which is not NFC. Basically avoiding this is pretty easy in practice, but it's worth noting.

There is an interesting case with U + 0338. > , followed by U + 0338, normalizes to ≯ and < to create ≮ . The reasons why it should not be resolved at the beginning of the element name or as the first character in the element should be clear.

As a rule, it makes no sense that part of the text begins with a combining character in any case, but this specific example allows you to distort the entire document (even if you do not normalize, because something else can).

If you are only concerned about the text of the text of the text (for example, digital signatures are not of interest), then normalizing the input simplifies the rest of what you are doing, including internal use of the text (such as searching), so this is probably the way to go.

See http://www.w3.org/TR/charmod-norm/ for more details.

+1

Jon hanna Jan 12 '12 at 9:10

source share

Jukka K. Korpela · Accepted Answer · 2012-01-07T09:15:43+0000

You decide, based on the purpose and nature of your application, whether you apply normalization when reading user input or storing it in a database or when writing it or in general. To summarize the long topic mentioned in the comments on the question, also available in the official list archive at http://validator.w3.org/feedback.html

The warning message comes from an experimental "HTML5 check" (which is really a letter, applying subjective rules in addition to some formal tests).
The message is not based on any requirements in HTML5 projects, but on opinions about what might cause problems in some programs.
The opinions originally made by HTML5 Validation cause an error message, now a warning.

Of course, it is possible, albeit unusual, to receive abnormalized data as user input. This does not depend on the normalization performed by browsers (they do not do such things, although they may possibly be in the future), but on input methods and habits. For example, input methods for the letter ü (u umlaut, or u with diaresis) tend to generate a character in precomposition form, as was normalized. People can do this as abnormal, in an unfolded form, like the letter u, followed by a combination of diaresis, but they usually have no reason for this, and most people do not even know how to do it.

If you are comparing strings in your software, they may or may not (depending on the comparison routines used) relate, for example. previously agreed ü as equal to the decomposed representation. Simple implementations see them as different because they definitely differ at a simple character level (Unicode code points).

One of the reasons for normalization at some point, at the stage of writing the last one, is that precomposed characters are usually displayed more reliably. To present a normalized ü, the program simply needs to get the glyph from the font. To represent the decomposed ü, the program must either recognize it as canonically equivalent to the normalized ü, or write the letter u with the corresponding character located above it, with due attention to the graphic properties of the glyph for u, and many programs are not executed in this.

On the other hand, in rare cases when abnormalized data is accepted as user input, the user may well have a reason for creating it. He may have the idea that normalized ü and non-normalized ü are different and should be considered as such.

Unicode normalization according to W3C in PHP

More articles: