UTF-8 Special Character Count

I find a way to count a special character, which is formed by more than one character, but did not find a solution online!

For example, I want to count the string "வாழைப்பழம". It actually consists of 6 Tamil characters, but its 9 characters in this case, when we use the usual way to find the length. I am wondering, Tamil is the only encoding that will cause this problem, and if there is a solution to this. I'm currently trying to find a solution in C #.

Thanks in advance =)

+6
source share
2 answers

Use StringInfo.LengthInTextElements :

 var text = "வாழைப்பழம"; Console.WriteLine(text.Length); // 9 Console.WriteLine(new StringInfo(text).LengthInTextElements); // 6 

An explanation of this behavior can be found in the String.Length documentation:

The Length property returns the number of Char objects in this instance, and not the number of Unicode characters. The reason is that a Unicode character can be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.

+9
source

Little nitpick: string in .NET uses UTF-16, not UTF-8


When you talk about line length, you can say a few different things:

  • Length in bytes. This is the old C way of looking at things, usually.
  • Length in Unicode code points. This gets you closer to the modern times and should be the way how string lengths are treated, except it isn't.
  • Length in UTF-8 / UTF-16 code units. This is the most common interpretation, deriving from 1. Certain characters take more than one code unit in those encodings which complicates things if you don't expect it.
  • Count of visible "characters" (graphemes). This is usually what people mean when they say characters or length of a string.

In your case, your confusion stems from the difference between 4. and 3. 3. what C # uses 4. is what you expect. Complex scripts, such as Tamil, use ligatures and diacritics. Ligatures are a compression of two or more adjacent characters into one character - in your case ழை is a ligature ழ and ை - the last of which changes the appearance of the first; வா is also such a ligature. Deacrites are ornaments around a letter, for example. emphasis in à or dot over ப்.

The two cases I mentioned lead to the same grapheme (what you perceive as one character), but both of them need two valid characters. Thus, you end up with three code points per line.

One note: for your case, the difference between 2. and 3. does not matter, but in general you should keep this in mind.

+3
source

Source: https://habr.com/ru/post/918208/


All Articles