Diacritical Combination Properties

To combine diacritics, they are considered letters? Because, as far as I know, they can only be combined with other letters in a well-formed Unicode.

The ICU's function for determining whether a Unicode code word is a letter takes only one code, so for any given code point it cannot know if they were combined with diacritics, or if it is diacritical, which was combined with. I am trying to implement something like a regex that supports Unicode using a type construct

while(is_letter(codepoint)) 

However, I am very concerned about what happens if the codepoint is actually diacritic, which will be matched with the previous code number and other matching labels.

Is it safe to do this? Or will I have to explicitly find and ignore diacritics and other marks?

Edit: I really need to do iterations of characters, not code pages.

This question is a victim of the XY problem. I need to ask a question about my real problem.

+4
source share
1 answer

I do not quite understand what you are trying to do, so I apologize in advance if this is not the answer you are looking for, but:

To combine diacritics, they are considered letters?

Broadly speaking, diacritics are counted as "marks" rather than "letters". For example, U + 0301 COMBINING ACUTE ACCENT, as in <ś>, is a "nonspacing mark", which is one of three kinds of "mark". However, the "modifier letters", which are counted as "letters", might nonetheless be thought of as diacritics; for example, U + 02C0 MODIFIER LETTER GLOTTAL STOP, as in <sˀ>, is a "modifier letter".

If you look at the main Unicode character database file (warning: this is a 1.3 megabyte text file), you may feel for which characters are classified as “modifier letters” ( Lm ) and which are “non-proliferation marks” ( Mn ) or “ interval marks "( Ms ) or" covering marks "( Me ).

+4
source

Source: https://habr.com/ru/post/1383233/


All Articles