Identifying if a character is a digit or a Unicode character inside a word in python

Question

Identifying if a character is a digit or a Unicode character inside a word in python

I want to find if the word contains a number and symbols, and if so, separate the digital part and the symbol part. I want to check the words tamil, for example: ரூ.100or ரூ100. I want to highlight ரூ.both 100, and ரூand 100. How to do this in python. I tried like this:

    for word in f.read().strip().split(): 
      for word1, word2, word3 in zip(word,word[1:],word[2:]): 
        if word1 == "ர" and word2 == "ூ " and word3.isdigit(): 
           print word1 
           print word2 
        if word1.decode('utf-8') == unichr(0xbb0) and word2.decode('utf-8') == unichr(0xbc2): 
           print word1 print word2

+4

python regex unicode-string tamil

charvi Mar 30 '14 at 7:16

source share

2 answers

unicode:

\pL
\pN .

:

(\pL+\.?)(\pN+)

+1

Toto 30 . '14 11:06

alecxe · Accepted Answer · 2014-03-30T07:25:20+0000

You can use a regular expression (.*?)(\d+)(.*)that will save 3 groups: everything up to numbers, numbers and everything after:

>>> import re
>>> pattern = ur'(.*?)(\d+)(.*)'
>>> s = u"ரூ.100"
>>> match = re.match(pattern, s, re.UNICODE)
>>> print match.group(1)
ரூ.
>>> print match.group(2)
100

Or you can unpack the mapped groups into variables, for example:

>>> s = u"100ஆம்"
>>> match = re.match(pattern, s, re.UNICODE)
>>> before, digits, after = match.groups()
>>> print before

>>> print digits
100
>>> print after
ஆம்

Hope this helps.

Identifying if a character is a digit or a Unicode character inside a word in python

More articles: