I process one hundred thousand (100,000) documents (the average length of a document is about 500 terms). For each document I want to get similar top k documents (for example, k = 5) by the similarity of cosines. So how to do this efficiently with Python .
Here is what I did:
- for each document, perform text segmentation, delete stop words, counter frequency (tf)
- so we get a tf matrix, about 100,000 documents * 600,000 terms
- do 1 - pairwise_distances (tf_matrix, metric = "cosine")
- for each document, get the best similar documents.
I run my code on i5-2.5GHz, 12 hours, but it still works. So I want to know how to optimize my code or procedure.
Here is my thought:
- for each document, make a function selection, just save the terms, tf> 1
- clustering first, then calculate the cosine similarity in each cluster
- since I just need the best similar documents, do I need to calculate all paired cosine similarities?
- Python programming or concurrent programming?
So do you have a good idea?
Many thanks.
I know there is a similar question , but that is not what I want.
Update1
Thanks to @orange, after profiling, I found that step 2 was a bottleneck! Here is a sample code:
def construct_dt_matrix():
dt_matrix = pd.DataFrame(columns=['docid'])
docid = 0
for f in files:
dt_matrix.loc[docid] = [0] * dt_matrix.shape[1]
dt_matrix.set_value(docid, 'docid', docid)
for key, value in cleaned_dict.items():
if key not in dt_matrix.columns.values:
dt_matrix[key] = 0
dt_matrix.set_value(docid, key, value)
docid += 1
So, the bottleneck adds new rows and columns to pandas. Any idea?