I am new to python and naively wrote a python script for the following task:
I want to create a package of words to represent several objects. Each object is basically a pair and a bag of words representing a synopsis. Thus, the object is converted into final documents.
Here is the script:
import re import math import itertools from nltk.corpus import stopwords from nltk import PorterStemmer from collections import defaultdict from collections import Counter from itertools import dropwhile import sys, getopt inp = "inp_6000.txt" #input file name out = "bowfilter10" #output file name with open(inp,'r') as plot_data: main_dict = Counter() file1, file2 = itertools.tee(plot_data, 2) line_one = itertools.islice(file1, 0, None, 4) line_two = itertools.islice(file2, 2, None, 4) dictionary = defaultdict(Counter) doc_count = defaultdict(Counter) for movie_name, movie_plot in itertools.izip(line_one, line_two): movie_plot = movie_plot.lower() words = re.findall(r'\w+', movie_plot, flags = re.UNICODE | re.LOCALE) #split words elemStopW = filter(lambda x: x not in stopwords.words('english'), words) #remove stop words, python nltk for word in elemStopW: word = PorterStemmer().stem_word(word) #use python stemmer class to do stemming #increment the word count of the movie in the particular movie synopsis dictionary[movie_name][word] += 1 #increment the count of a partiular word in main dictionary which stores frequency of all documents. main_dict[word] += 1 #This is done to calculate term frequency inverse document frequency. Takes note of the first occurance of the word in the synopsis and neglect all other. if doc_count[word]['this_mov']==0: doc_count[word].update(count=1, this_mov=1); for word in doc_count: doc_count[word].update(this_mov=-1) #print "---------main_dict---------" #print main_dict #Remove all the words with frequency less than 5 in whole set of movies for key, count in dropwhile(lambda key_count: key_count[1] >= 5, main_dict.most_common()): del main_dict[key] #print main_dict .#Write to file bow_vec = open(out, 'w'); #calculate the the bog vector and write it m = len(dictionary) for movie_name in dictionary.keys(): #print movie_name vector = [] for word in list(main_dict): #print word, dictionary[movie_name][word] x = dictionary[movie_name][word] * math.log(m/doc_count[word]['count'], 2) vector.append(x) #write to file bow_vec.write("%s" % movie_name) for item in vector: bow_vec.write("%s," % item) bow_vec.write("\n")
Data file format and additional data information: The data file has the following format:
Movie title. Empty line. The synopsis of the film (it can be assumed that the size is about 150 words) An empty line.
Note: <*> are for presentation.
File Input Size:
The file size is about 200 MB.
This script currently takes about 10-12 hours on an Intel 3 GHz processor.
Note. I am looking for improvements in serial code. I know that parallelization will improve it, but I want to study it later. I want to take this opportunity to make this serial code more efficient.
Any help was appreciated.