How to make casefold () work on some arabic unicode

I have some problems with detecting "equality" in Python 2.7 of some Arabic word pairs:

  • أكثر vs اكثر
  • قائمة vs قائمه
  • إنشاء vs انشاء

The elements of each pair are not actually identical, but they are written with different cases. A useful analogy for me (I don't know a single Arabic) is a word against a word. They are not identical, but if I describe them both, I will get a word against a word that will be identical. This is what I want to get from these three pairs of Arabic words.

I will show an example of what I tried using the first pair (1. أكثر vs اكثر). By the way, the meaning of both Arabic words from the first pair is "menu" "more", but they have different cases (as a parallel: Menu and menu More vs. more). I don’t know the Arabic language or the Arabic rules at all, so if someone who knows Arabic can confirm that these words are “identical,” that would be great.

str1 = u'أكثر'
str2 = u'اكثر'

So, I'm trying to lead str1, and str2to the same form (if possible), so I want a function that produces the same output for both lines:

transform(str1) == transform(str2)

In English this can be achieved easily:

a = u'More'
b = u'more'

def transform(text):
    return text.lower()

>>> transform(a) == transform(b)
>>> True

But of course, this does not work in Arabic, as there are no such things as lowercase or uppercase.

>>> str1
u'\u0623\u0643\u062b\u0631'

>>> str2
u'\u0627\u0643\u062b\u0631'

, unicode .

, :

import unicodedata

>>> n_str1 = unicodedata.normalize('NFKD', str1)
>>> n_str2 = unicodedata.normalize('NFKD', str2)

>>> n_str1
u'\u0627\u0654\u0643\u062b\u0631'

>>> n_str2
u'\u0627\u0643\u062b\u0631'

:

>>> n_str1 == n_str2
False

unicode.casefold(), Python 2. py2casefold, . Python 3 unicode.casefold(), :

>>> str1.casefold() == str2.casefold()
False

>>> n_str1.casefold() == n_str2.casefold()
False

Python 2 , Python 3.

.

+4
1

: u 'أكثر' u 'اكثر' . , , , - :

Alif with Hamazah

, , Alif * ( ):

Alif

, , . ​​ . :

>>> u'أكثر'; u'اكثر'
u'\u0623\u0643\u062b\u0631'
u'\u0627\u0643\u062b\u0631'

, , vs word, . , .

. , , , . , . , , , . , :

1- , moore

2- , manu

3- , estblish

, , (1. أكثر vs اكثر). , - "", ( : )

, أكثر . , , . , .

+3

Source: https://habr.com/ru/post/1683970/


All Articles