Python regex module not working with utf-8 (Devnagari)

I am using python 2.7 for Bodo NLP (uses Devnagari script)

In the process of deleting a stop word, I made a list of stop words in a file separated by a new line ("\ n"). I used the codecs module to read this file and convert to a list.

raw_txt = codecs.open('stopwords.txt', 'r', 'utf-8') stopWords = [] while(1): line = raw_txt.readline() if not line: break line = u''.join(line.strip()) stopWords.append(line) 

Now I compiled a regex to search for matching words:

 def addWordBoundary(word): return u''.join(r"\b" + word + r"\b") reg = regex.compile(r"(%s)" % "|".join(map(addWordBoundary, stopWords)), regex.UNICODE) 

I read the corpus (text file) using the codecs module into a string and regex.sub (), and then write it to a file using the codecs themselves. But he missed a few words. I could not understand why.

 fl = codecs.open('corpus.txt', 'r', 'utf-8') rawFile = fl.read() cleanText = reg.sub('', rawFile, regex.U) wr = codecs.open('output.txt', 'w', 'utf-8') wr.write(cleanText) wr.close() 

For testing purposes use this both stopwords.txt and corpus.txt

рдорд╛рдмреНрд▓рд╛рдирд┐рдлреНрд░рд╛рдпрдереЛ
рдлрд╛рд░рд╕реЗрдирд┐рдлреНрд░рд╛рдпрдмреЛ
рд░рд╛рд╡рдирд┐рдлреНрд░рд╛рдпрдмреЛ
рдорд╛рдмреНрд▓рд╛рдирд┐рдлреНрд░рд╛рдп
рдЬреЗрдмреНрд▓рд╛рдирд┐рдлреНрд░рд╛рдп
рдЕрдмреНрд▓рд╛рдирд┐рдлреНрд░рд╛рдп
рдЗрдлреЛрд░рдирд┐рдлреНрд░рд╛рдп
рди реЛрдВ рдирд┐рдлреНрд░рд╛рдпрдмреЛ
рдлрд╛рд░рд╕реЗрдирд┐рдлреНрд░рд╛рдп
рди реЛрдВ рдирд┐рдлреНрд░рд╛рдпрдиреЛ

The output.txt file should be an empty file, but it contains:

рд░рд╛рд╡рдирд┐рдлреНрд░рд╛рдпрдмреЛ
рдЗрдлреЛрд░рдирд┐рдлреНрд░рд╛рдп

This code works well for English text (ASCII), so maybe I am doing something wrong with utf-8 processing. Please suggest.

+4
source share
1 answer

Some stop words in the file you provided begin or end with characters that are not defined as alphanumeric:

 import unicodedata as ud for w in stopWords: for c in w[0], w[-1]: print repr(c), ud.category(c), print 
 u'\u092e' Lo u'\u094b' Mc u'\u092b' Lo u'\u094b' Mc u'\ufeff' Cf u'\u094b' Mc u'\u092e' Lo u'\u092f' Lo u'\u091c' Lo u'\u092f' Lo u'\u0905' Lo u'\u092f' Lo u'\ufeff' Cf u'\u092f' Lo u'\u0928' Lo u'\u094b' Mc u'\u092b' Lo u'\u092f' Lo u'\u0928' Lo u'\u094b' Mc 

In particular, the two lines тАФ the ones you see in the output.txt file тАФ begin with u'\ufeff' :

 ud.name(u'\ufeff') == 'ZERO WIDTH NO-BREAK SPACE' 

This is also known as the Byte Byte Character (BOM) and is sometimes used at the beginning of a file to identify the encoding. Here, it was probably accidentally included in the file when editing it. Python seems to delete the character if it is at the very beginning of the file, but not when it appears elsewhere in the file. strip() apparently not enough to remove it. These characters must be manually removed from the input file.

I also get those that end with u'\u094b' (DEVANAGARI VOWEL SIGN O) on the output, so my copy of Python does not seem to treat them as alphanumeric characters.

In addition, in the general case, when you want to match exact strings in a regular expression, you must use re.escape(string) before inserting it into the regular expression if the string contains characters that will be treated as metacharacters of the regular expression.

+2
source

Source: https://habr.com/ru/post/1482797/


All Articles