I am using python 2.7 for Bodo NLP (uses Devnagari script)
In the process of deleting a stop word, I made a list of stop words in a file separated by a new line ("\ n"). I used the codecs module to read this file and convert to a list.
raw_txt = codecs.open('stopwords.txt', 'r', 'utf-8') stopWords = [] while(1): line = raw_txt.readline() if not line: break line = u''.join(line.strip()) stopWords.append(line)
Now I compiled a regex to search for matching words:
def addWordBoundary(word): return u''.join(r"\b" + word + r"\b") reg = regex.compile(r"(%s)" % "|".join(map(addWordBoundary, stopWords)), regex.UNICODE)
I read the corpus (text file) using the codecs module into a string and regex.sub (), and then write it to a file using the codecs themselves. But he missed a few words. I could not understand why.
fl = codecs.open('corpus.txt', 'r', 'utf-8') rawFile = fl.read() cleanText = reg.sub('', rawFile, regex.U) wr = codecs.open('output.txt', 'w', 'utf-8') wr.write(cleanText) wr.close()
For testing purposes use this both stopwords.txt and corpus.txt
рдорд╛рдмреНрд▓рд╛рдирд┐рдлреНрд░рд╛рдпрдереЛ
рдлрд╛рд░рд╕реЗрдирд┐рдлреНрд░рд╛рдпрдмреЛ
рд░рд╛рд╡рдирд┐рдлреНрд░рд╛рдпрдмреЛ
рдорд╛рдмреНрд▓рд╛рдирд┐рдлреНрд░рд╛рдп
рдЬреЗрдмреНрд▓рд╛рдирд┐рдлреНрд░рд╛рдп
рдЕрдмреНрд▓рд╛рдирд┐рдлреНрд░рд╛рдп
рдЗрдлреЛрд░рдирд┐рдлреНрд░рд╛рдп
рди реЛрдВ рдирд┐рдлреНрд░рд╛рдпрдмреЛ
рдлрд╛рд░рд╕реЗрдирд┐рдлреНрд░рд╛рдп
рди реЛрдВ рдирд┐рдлреНрд░рд╛рдпрдиреЛ
The output.txt file should be an empty file, but it contains:
рд░рд╛рд╡рдирд┐рдлреНрд░рд╛рдпрдмреЛ
рдЗрдлреЛрд░рдирд┐рдлреНрд░рд╛рдп
This code works well for English text (ASCII), so maybe I am doing something wrong with utf-8 processing. Please suggest.