Python regex module not working with utf-8 (Devnagari)

Question

Python regex module not working with utf-8 (Devnagari)

I am using python 2.7 for Bodo NLP (uses Devnagari script)

In the process of deleting a stop word, I made a list of stop words in a file separated by a new line ("\ n"). I used the codecs module to read this file and convert to a list.

raw_txt = codecs.open('stopwords.txt', 'r', 'utf-8') stopWords = [] while(1): line = raw_txt.readline() if not line: break line = u''.join(line.strip()) stopWords.append(line)

Now I compiled a regex to search for matching words:

 def addWordBoundary(word): return u''.join(r"\b" + word + r"\b") reg = regex.compile(r"(%s)" % "|".join(map(addWordBoundary, stopWords)), regex.UNICODE)

I read the corpus (text file) using the codecs module into a string and regex.sub (), and then write it to a file using the codecs themselves. But he missed a few words. I could not understand why.

 fl = codecs.open('corpus.txt', 'r', 'utf-8') rawFile = fl.read() cleanText = reg.sub('', rawFile, regex.U) wr = codecs.open('output.txt', 'w', 'utf-8') wr.write(cleanText) wr.close()

For testing purposes use this both stopwords.txt and corpus.txt

माब्लानिफ्रायथो
फारसेनिफ्रायबो
रावनिफ्रायबो
माब्लानिफ्राय
जेब्लानिफ्राय
अब्लानिफ्राय
इफोरनिफ्राय
न ों निफ्रायबो
फारसेनिफ्राय
न ों निफ्रायनो

The output.txt file should be an empty file, but it contains:

रावनिफ्रायबो
इफोरनिफ्राय

This code works well for English text (ASCII), so maybe I am doing something wrong with utf-8 processing. Please suggest.

+4

python regex utf-8 nlp

srajbr May 26 '13 at 8:44

source share

1 answer

Stian ellingsen · Answer 1 · 2013-06-02T20:13:43+0000

Some stop words in the file you provided begin or end with characters that are not defined as alphanumeric:

 import unicodedata as ud for w in stopWords: for c in w[0], w[-1]: print repr(c), ud.category(c), print

 u'\u092e' Lo u'\u094b' Mc u'\u092b' Lo u'\u094b' Mc u'\ufeff' Cf u'\u094b' Mc u'\u092e' Lo u'\u092f' Lo u'\u091c' Lo u'\u092f' Lo u'\u0905' Lo u'\u092f' Lo u'\ufeff' Cf u'\u092f' Lo u'\u0928' Lo u'\u094b' Mc u'\u092b' Lo u'\u092f' Lo u'\u0928' Lo u'\u094b' Mc

In particular, the two lines — the ones you see in the output.txt file — begin with u'\ufeff' :

 ud.name(u'\ufeff') == 'ZERO WIDTH NO-BREAK SPACE'

This is also known as the Byte Byte Character (BOM) and is sometimes used at the beginning of a file to identify the encoding. Here, it was probably accidentally included in the file when editing it. Python seems to delete the character if it is at the very beginning of the file, but not when it appears elsewhere in the file. strip() apparently not enough to remove it. These characters must be manually removed from the input file.

I also get those that end with u'\u094b' (DEVANAGARI VOWEL SIGN O) on the output, so my copy of Python does not seem to treat them as alphanumeric characters.

In addition, in the general case, when you want to match exact strings in a regular expression, you must use re.escape(string) before inserting it into the regular expression if the string contains characters that will be treated as metacharacters of the regular expression.

Python regex module not working with utf-8 (Devnagari)

More articles: