Stop nltk / python words

I have code that processes the data set for later use, the code I use for stop words seems to be all right, however I think the problem lies in the rest of my code, as it seems to remove some of the stops -the words.

import re import nltk # Quran subset filename = 'subsetQuran.txt' # create list of lower case words word_list = re.split('\s+', file(filename).read().lower()) print 'Words in text:', len(word_list) word_list2 = [w for w in word_list if not w in nltk.corpus.stopwords.words('english')] # create dictionary of word:frequency pairs freq_dic = {} # punctuation and numbers to be removed punctuation = re.compile(r'[-.?!,":;()|0-9]') for word in word_list2: # remove punctuation marks word = punctuation.sub("", word) # form dictionary try: freq_dic[word] += 1 except: freq_dic[word] = 1 print '-'*30 print "sorted by highest frequency first:" # create list of (val, key) tuple pairs freq_list2 = [(val, key) for key, val in freq_dic.items()] # sort by val or frequency freq_list2.sort(reverse=True) freq_list3 = list(freq_list2) # display result for freq, word in freq_list2: print word, freq f = open("wordfreq.txt", "w") f.write( str(freq_list3) ) f.close() 

The result is as follows

 [(71, 'allah'), (65, 'ye'), (46, 'day'), (21, 'lord'), (20, 'truth'), (20, 'say'), (20, 'and') 

This is just a small sample, there are others that needed to be removed. Any help is appreciated.

+4
source share
1 answer

try deleting your words when creating word_list2

 word_list2 = [w.strip() for w in word_list if w.strip() not in nltk.corpus.stopwords.words('english')] 
+3
source

Source: https://habr.com/ru/post/1346038/


All Articles