Arabic / RTL text analysis from left to right

Let's say I have a string in RTL, for example, in Arabic with some English pinned to:

string s = "Test:لطيفة;اليوم;a;b"

Notice that there are semicolons in the line. When I use the Split command, for example string[] spl = s.Split(';'); , some lines are stored in reverse order. Here's what happens:

spl [0] = "Test: لطيفة"
spl [1] = "" اليوم
spl [2] = "a"
spl [3] = "b"

The above is not in order compared to the original. Instead, I expect to get the following:

spl [0] = "Test: اليوم"
spl [1] = "لطيفة"
spl [2] = "a"
spl [3] = "b"

I am ready to write my own separation function. However, the characters in the string are also parsed in reverse order, so I will go back to the square. I just want to go through each character as shown on the screen.

+5
source share
4 answers

Since your string is currently standing, the word لطيفة is preserved until the word اليوم; the fact that اليوم is displayed "first" (that is, to the left) is just the (correct) result of the bidirectional Unicode algorithm when displaying text.

That is: the line you start with ("Test: لطيفة; اليوم; a; b") is the result of the user entering "Test:", then لطيفة, then ";", then "اليوم", and then "; and B. " So the C # splitting method actually reflects the way the string is created. Exactly the way it is created is not displayed in the string display, because two consecutive Arabic words are treated as a unit when they are displayed.

If you want the string to display Arabic words in left-to-right order with a semicolon between them, and also keeping the words in the same order, then you must place the Left-Right sign (U + 200E) after the semicolon. This effectively separates each Arabic word as its own unit, and a bi-directional algorithm will process each word separately.

For example, the following code starts with a line identical to the one you are using (with one label added from left to right), but it will separate it according to how you expect it (ie spl [0] = "Test: اليوم "and spl [1] =" لطيفة "):

 static void Main(string[] args) { string s = "Test:اليوم;\u200Eلطيفة;a;b"; string[] spl = s.Split(';'); } 
+11
source

You can also use the Uniscribe library for Microsoft. The ScriptItemize method will give you character clusters, their starting index in the source string and the RTL order. Using this information, you can find sequential clusters that contain only Arabic. Separating them with respect to ';' and the opposite direction will give you what you need.

+2
source

These lines do not change, but are actually split in the correct order. Languages ​​RTL - RTL when displayed, but they are stored in memory "from left to right", like English. I will try to demonstrate that is a little complicated, since I do not have an Arabic keyboard installed.

Your string s = "Arbi/Arbi, Alarbia" . s [0] - A (Arabic A'in), s [1] - R, etc. s [4] is /, and s [9] is. Therefore, when split, you get s [0: 8] in the first part and s [10:] in the second.

This is the correct way to handle RTL strings. If you want the opposite, you need to modify the array yourself.

Keep in mind that switching between RTL and LTR is one of the most annoying tasks. You have no idea how much time you will spend figuring out what to do with numbers or English words inside RTL strings. The best you can do is to avoid the problem altogether and just try to get Excel to show the rows as RTL.

+1
source

It seems (according to Reflector) that Split internally uses Substring and uses an internal function that simply copies letters from left to right without any cultural considerations. Because of this, I see no way to just change the array returned by Split .

0
source

Source: https://habr.com/ru/post/892121/


All Articles