Python optimization script extract and process large data files

I am new to python and naively wrote a python script for the following task:

I want to create a package of words to represent several objects. Each object is basically a pair and a bag of words representing a synopsis. Thus, the object is converted into final documents.

Here is the script:

import re import math import itertools from nltk.corpus import stopwords from nltk import PorterStemmer from collections import defaultdict from collections import Counter from itertools import dropwhile import sys, getopt inp = "inp_6000.txt" #input file name out = "bowfilter10" #output file name with open(inp,'r') as plot_data: main_dict = Counter() file1, file2 = itertools.tee(plot_data, 2) line_one = itertools.islice(file1, 0, None, 4) line_two = itertools.islice(file2, 2, None, 4) dictionary = defaultdict(Counter) doc_count = defaultdict(Counter) for movie_name, movie_plot in itertools.izip(line_one, line_two): movie_plot = movie_plot.lower() words = re.findall(r'\w+', movie_plot, flags = re.UNICODE | re.LOCALE) #split words elemStopW = filter(lambda x: x not in stopwords.words('english'), words) #remove stop words, python nltk for word in elemStopW: word = PorterStemmer().stem_word(word) #use python stemmer class to do stemming #increment the word count of the movie in the particular movie synopsis dictionary[movie_name][word] += 1 #increment the count of a partiular word in main dictionary which stores frequency of all documents. main_dict[word] += 1 #This is done to calculate term frequency inverse document frequency. Takes note of the first occurance of the word in the synopsis and neglect all other. if doc_count[word]['this_mov']==0: doc_count[word].update(count=1, this_mov=1); for word in doc_count: doc_count[word].update(this_mov=-1) #print "---------main_dict---------" #print main_dict #Remove all the words with frequency less than 5 in whole set of movies for key, count in dropwhile(lambda key_count: key_count[1] >= 5, main_dict.most_common()): del main_dict[key] #print main_dict .#Write to file bow_vec = open(out, 'w'); #calculate the the bog vector and write it m = len(dictionary) for movie_name in dictionary.keys(): #print movie_name vector = [] for word in list(main_dict): #print word, dictionary[movie_name][word] x = dictionary[movie_name][word] * math.log(m/doc_count[word]['count'], 2) vector.append(x) #write to file bow_vec.write("%s" % movie_name) for item in vector: bow_vec.write("%s," % item) bow_vec.write("\n") 

Data file format and additional data information: The data file has the following format:

Movie title. Empty line. The synopsis of the film (it can be assumed that the size is about 150 words) An empty line.

Note: <*> are for presentation.

File Input Size:
The file size is about 200 MB.

This script currently takes about 10-12 hours on an Intel 3 GHz processor.

Note. I am looking for improvements in serial code. I know that parallelization will improve it, but I want to study it later. I want to take this opportunity to make this serial code more efficient.

Any help was appreciated.

0
source share
3 answers

First of all - try to reset regular expressions, they are heavy. My initial advice was crappy - it would not work. Perhaps it will be more efficient

 trans_table = string.maketrans(string.string.punctuation, ' '*len(string.punctuation)).lower() words = movie_plot.translate(trans_table).split() 

(Further thought) I cannot verify this, but I think that if you save the result of this call in a variable

 stops = stopwords.words('english') 

or perhaps better - convert it to a set first (if the function does not return one)

 stops = set(stopwords.words('english')) 

you will also get some improvement

(To answer your question in a comment) Each function call consumes time; if you get a large block of data than you do not use on an ongoing basis - a waste of time can be huge As for set vs list - compare the results:

 In [49]: my_list = range(100) In [50]: %timeit 10 in my_list 1000000 loops, best of 3: 193 ns per loop In [51]: %timeit 101 in my_list 1000000 loops, best of 3: 1.49 us per loop In [52]: my_set = set(my_list) In [53]: %timeit 101 in my_set 10000000 loops, best of 3: 45.2 ns per loop In [54]: %timeit 10 in my_set 10000000 loops, best of 3: 47.2 ns per loop 

While we are in greasy details - here are the measurements for split vs. RE

 In [30]: %timeit words = 'This is a long; and meaningless - sentence'.split(split_let) 1000000 loops, best of 3: 271 ns per loop In [31]: %timeit words = re.findall(r'\w+', 'This is a long; and meaningless - sentence', flags = re.UNICODE | re.LOCALE) 100000 loops, best of 3: 3.08 us per loop 
+5
source

Profile your code

From the Python wiki page:

The first step to speeding up your program is to examine where the bottlenecks are. It hardly makes sense to optimize code that never runs or that already runs fast Before you change any of them, you should review it.

This page should have most of what you need (read the rest too): http://wiki.python.org/moin/PythonSpeed/PerformanceTips#Profiling_Code

Try using a more native Python style.

Not seeing the profiling results, I'm sure it will speed up if you get rid of nested for-loops; all the talk about Python's coding style from PyCon 2013 says that understanding lists and other methods is faster. Try to talk:

http://pyvideo.org/video/1758/loop-like-a-native-while-for-iterators-genera

And do a search for a conversation called "Converting code into beautiful, idiomatic Python."

Or use Pandas (designed to work with a dataset)

In addition, you can speed up the calculation using the built-in functionality of the Python pandas module. It will allow you to immediately download the entire input file and easily perform calculations and amounts, as well as other type reduction operations on your large data set.

The Python for Data Analysis book (written by pandas creator) provides an example of basic statistics for several real data sets. You would probably get your money from examples alone. I bought it and love it.

I wish you good luck.

0
source

Another thing that can slow down performance is removing from the dictionary. Recovery dictionary can be much more efficient:

 word_dict = {key: count for key, count in takewhile(lambda key_count: itemgetter(1) >= 5, main_dict.most_common()) 

All in all, I'm a little lazy to sort out all the details, but I use few links that can be more effective. As far as I know, you do not need the * doc_count * variable - it is redundant and inefficient, and reevaluation also reduces your performance. * main_dict.keys () * does the same - it will list all words once.

This is a sketch of what I mean - I cannot prove that it is more effective, but it certainly looks more pythonic

 with open(inp,'r') as plot_data: word_dict = Counter() file1, file2 = itertools.tee(plot_data, 2) line_one = itertools.islice(file1, 0, None, 4) line_two = itetools.islice(file2, 2, None, 4) all_stop_words = stopwords.words('english') movie_dict = defaultdict(Counter) stemmer_func = PorterStemmer().stem_word for movie_name, movie_plot in itertools.izip(line_one, line_two): movie_plot = movie_plot.lower() words = <see above - I am updating original post> all_words = [stemmer_func(word) for word in words if not word in all_stop_words] current_word_counter = Counter(all_words) movie_dict[movie_name].update(current_word_counter) word_dict.update(current_word_counter) 

The last dictionary is not a good variable name, it does not tell you what it contains

0
source

Source: https://habr.com/ru/post/1272769/


All Articles