I am implementing a statistical program and creating a bottleneck in performance, and was hoping that I could get some help from the community to point me towards optimization.
I create a set for each line in the file and find the intersection of this set by comparing the given data of each line in one file. Then I use the size of this intersection to filter out specific sets of output. The problem is that I have a nested loop cycle (O (n 2 )), and the standard size of the files included in the program is just over 20,000 lines. I timed the algorithm and up to 500 lines, it works after about 20 minutes, but for large files it takes about 8 hours.
I have 16 GB of RAM and a significantly faster 4-core Intel i7 processor. I did not notice significant differences in memory usage by copying list1 and using the second list for comparison, instead of opening the file again (maybe this is because I have an SSD?). I thought the open mechanism reads / writes directly to the hard drive, which is slower, but did not notice the difference when using the two lists. In fact, a program rarely uses more than 1 GB of RAM during operation.
I hope that other people use a specific data type or perhaps better understand multiprocessing in Python and can help me speed up the process. I appreciate any help, and I hope my code is not too poorly written.
import ast, sys, os, shutil
list1 = []
end = 0
filterValue = 3
with open(arg2 + arg1 + "/filteredSets" + str(filterValue) , "w") as outfile:
with open(arg2 + arg1 + "/file", "r") as infile:
for row in infile:
list1.append(set(ast.literal_eval(row)))
infile.seek(0)
for row in infile:
if not(len(list1) == 1):
set1 = set(ast.literal_eval(row))
for i in range(0, len(list1)):
if not(pos == i):
set2 = list1[i]
set3 = set1.intersection(set2)
if(len(set3) < filterValue):
if(i == len(list1)):
outfile.write(row)
pos += 1
else:
break
else:
outfile.write(row)
An input sample is a file with the following format:
[userID1, userID2, userID3]
[userID5, userID3, userID9]
[userID10, userID2, userID3, userID1]
[userID8, userID20, userID11, userID1]
The output file, if it was an input file, would be:
[userID5, userID3, userID9]
[userID8, userID20, userID11, userID1]
... .
