How to determine if a sequence of code points is a natural symbol?

Good afternoon everyone

I create a function that takes a string as input, removes any unnatural combinations of diacritical characters from the string, and returns the modified string as input.

An unnatural combination of diacritical sequence is a sequence of Unicode code points, which when combined creates output that does not belong to any language under the sun (ancient scripts / languages ​​are considered natural languages).

For example, when entering a string:

"aaà̴̵̶̷̸̡̢̧̨̛̖̗̘̙̜̝̞̟̠̣̤̥̦̩̪̫̬̭̮̯̯̰̱̲̳̹̺̻̼͇͈͉͍͎́̂̃̄̅̆̇̈̉̊̋̌̍̎̏̐̑̒̓̔̽̾̿̀́͂̓̈́͆͊͋͌̕̚͠͡ͅaa" //code points 0061 0061 0061 0300 0301 0302 0303 0304 0305 0306 0307 0308 0309 030a 030b 030c 030d 030e 030f 0310 0311 0312 0313 0314 0315 0316 0317 0318 0319 031a 031b 031c 031d 031e 031f 0320 0321 0322 0323 0324 0325 0326 0327 0328 0329 032a 032b 032c 032d 032e 032f 032f 0330 0331 0332 0333 0334 0335 0336 0337 0338 0339 033a 033b 033c 033d 033e 033f 0340 0341 0342 0343 0344 0345 0346 0347 0348 0349 034a 034b 034c 034d 034e 0360 0361 0061 0061 

the function should return the result aaàaa (code points 0061 0061 0061 0300 0061 0061),

Since à́ (code points 0061 0300 0301) is not a symbol in any natural language. In other words:

  assert F("aaà̴̵̶̷̸̡̢̧̨̛̖̗̘̙̜̝̞̟̠̣̤̥̦̩̪̫̬̭̮̯̯̰̱̲̳̹̺̻̼͇͈͉͍͎́̂̃̄̅̆̇̈̉̊̋̌̍̎̏̐̑̒̓̔̽̾̿̀́͂̓̈́͆͊͋͌̕̚͠͡ͅaa").equals("aaàaa"); 

Or for source code stored using latin encodings:

  assert F("\u0061\u0061\u0061\u0300\u0301\u0302\u0303\u0304\u0305\u0306\u0307\u0308\u0309\u030a\u030b\u030c\u030d\u030e\u030f\u0310\u0311\u0312\u0313\u0314\u0315\u0316\u0317\u0318\u0319\u031a\u031b\u031c\u031d\u031e\u031f\u0320\u0321\u0322\u0323\u0324\u0325\u0326\u0327\u0328\u0329\u032a\u032b\u032c\u032d\u032e\u032f\u032f\u0330\u0331\u0332\u0333\u0334\u0335\u0336\u0337\u0338\u0339\u033a\u033b\u033c\u033d\u033e\u033f\u0340\u0341\u0342\u0343\u0344\u0345\u0346\u0347\u0348\u0349\u034a\u034b\u034c\u034d\u034e\u0360\u0361\u0061\u0061").equals("\u0061\u0061\u0061\u0300\u0061\u0061"); 

How can we determine if a sequence of characters or a sequence of Unicode codes is natural?

Or rather, is there a limit on the number of combinations of diacritical characters that a character belonging to a natural language will use?

+4
source share
3 answers

Unicode 6.0 :

All character combinations can be applied to any base character and can, in principle, be used with any script. As with other characters, highlighting a combination character in a block or other identifies only its main use; it is not intended to define or limit the range of characters to which it may apply. In Unicode, all character code sequences are allowed.

This does not create an implementation obligation to support all possible combinations. Equally good. Thus, although the application of the Arabic annotation to the Khan sign or Devanagari consonant is permitted, it is unlikely that it will be well supported in or make sense.

Unicode data is unlikely to have enough information to do this algorithmically.

There are some rules for canonical composition / decomposition that you could use to determine if a sequence is a natural sequence. For example, mapping U + 0065 U + 0301 to U + 00E9 (é.) But this will not work for every case.

Other than that, I'm not sure what you could do without using any form of validation table created by experts or created from some content of the language data.

+2
source

I think you just need Character.isLetter() . I just tried it with English, Russian and Hebrew characters, and it returns true for all letters and false for all characters that are not letters.

I do not know if characters are like .., ',' etc. natural, but you can easily list all of these characters if you need them.

+1
source

An unnatural combination of a diacritical sequence is a sequence of Unicode code points that, when combined, outputs a conclusion that does not belong to any language under the sun

I am afraid that you will not be able to satisfy this requirement without knowing all the languages ​​under the sun.

The closest you can be only the standard Unicode data set - normalize NFKC and see if there are any decomposed characters of the union class. This does not tell you anything about natural languages, it relies only on heuristics, which is likely to be a combined character defined for combinations that are widely used. This is true for the most common simple alphabets, which may be enough for you.

Is there a limit to how many combinations of diacritical characters a character belonging to natural language is used?

Not. In UAX 15, there is a practical limit in which a “safe stream” of text should not use 30 consecutive combining characters, which allows us to assume that the Unicode standard will generally avoid character descriptions that will cause many consecutive joiners to be used for use in the real world.

The longest cluster of natural grapheme that I know of is:

 ཧྐྵྨླྺྼྻྂ 

(one leading character and eight inconsistency characters.)

+1
source

Source: https://habr.com/ru/post/1393534/


All Articles