How to get only arabic texts from a string using regex?

Question

How to get only arabic texts from a string using regex?

I have a string that contains both Arabic and English sentences. I only want to extract the Arabic sentences.

my_string=""" What is the reason ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ behind this? ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ """

This link shows that the Unicode range for Arabic letters is 0600-06FF .

So, a very simple attempt occurred to me:

 import re print re.findall(r'[\u0600-\u06FF]+',my_string)

But this fails, as it returns the following list.

 ['What', 'is', 'the', 'reason', 'behind', 'this?']

As you can see, this is the exact opposite of what I want. What am I missing here?

NB

I know that I can match Arabic letters using reverse matching, as shown below:

 print re.findall(r'[^a-zA-Z\s0-9]+',my_string)

But I do not want this.

+5

python string python-2.7 regex unicode

Ahsanul haque Apr 16 '16 at 8:16

source share

3 answers

The source code was correct, you just had to encode my_string with the correct encoding, "utf-8" and add u to your re template, since you are working with Python2,

 >>> for x in re.findall(ur'[\u0600-\u06FF]+', my_string.decode('utf-8')): print x ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ

This will give you a list of matching unicode strings instead of single characters that you don't need to join with ''.join

If you were in Python3, you do not need to encrypt the encoding, since the default encoding is "utf-8":

 >>> for x in re.findall(r'[\u0600-\u06FF]+', my_string): print(x) ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ

+2

Iron fist Apr 16 '16 at 9:26

source share

Your code:

 print re.findall(r'[\u0600-\u06FF]+',my_string)

When matching byte sequences there is no such thing as Unicode code codes. Therefore, the escape sequences \u in the regex do not make any sense. They are not interpreted as you thought, but simply mean u .

Therefore, when parsing a regular expression for bytes, this is equivalent to:

 print re.findall(r'[u0600-u06FF]+',my_string)

This character class is interpreted as "one of u060 , or bytes in the range 0-u , or one of 06FF ". This, in turn, is equivalent to [0-u] , since all other bytes are already included in this range.

 print re.findall(r'[0-u]+', my_string)

Demonstration:

 my_string = "What is thizz?" print re.findall(r'[\u0600-\u06FF]+',my_string) ['What', 'is', 'thi', '?']

Note that zz does not match, as it stands for u in the ASCII character set.

+2

Rolling illig Apr 16 '16 at 9:33

source share

styvane · Accepted Answer · 2016-04-16T08:26:12+0000

You can use re.sub to replace ascii characters with an empty string.

 >>> my_string=""" ... What is the reason ... ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ ... behind this? ... ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ ... """ >>> print(re.sub(r'[a-zA-Z?]', '', my_string).strip()) ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ

Your regex does not work because you are using Python 2, and your str string you need to convert my_string to unicode for it to work. However, it worked great on Python3.x

 >>> print "".join(re.findall(ur'[\u0600-\u06FF]', unicode(my_string, "utf-8"), re.UNICODE)) ذَلِكَالْكِتَابُلَارَيْبَفِيهِهُدًىلِلْمُتَّقِينَذَلِكَالْكِتَابُلَارَيْبَفِيهِهُدًىلِلْمُتَّقِينَ

How to get only arabic texts from a string using regex?

More articles: