How to efficiently get top K-like cosine-like document using python?

Question

How to efficiently get top K-like cosine-like document using python?

I process one hundred thousand (100,000) documents (the average length of a document is about 500 terms). For each document I want to get similar top k documents (for example, k = 5) by the similarity of cosines. So how to do this efficiently with Python .

Here is what I did:

for each document, perform text segmentation, delete stop words, counter frequency (tf)
so we get a tf matrix, about 100,000 documents * 600,000 terms
do 1 - pairwise_distances (tf_matrix, metric = "cosine")
for each document, get the best similar documents.

I run my code on i5-2.5GHz, 12 hours, but it still works. So I want to know how to optimize my code or procedure.

Here is my thought:

for each document, make a function selection, just save the terms, tf> 1
clustering first, then calculate the cosine similarity in each cluster
since I just need the best similar documents, do I need to calculate all paired cosine similarities?
Python programming or concurrent programming?

So do you have a good idea?

Many thanks.

I know there is a similar question , but that is not what I want.

Update1

Thanks to @orange, after profiling, I found that step 2 was a bottleneck! Here is a sample code:

def construct_dt_matrix():
    dt_matrix = pd.DataFrame(columns=['docid'])
    docid = 0
    for f in files:
        # text segmentation for f
        # remove stop words
        # word count store in cleaned_dict = {'word': tf}
        dt_matrix.loc[docid] = [0] * dt_matrix.shape[1] # add one row, init all 0
        dt_matrix.set_value(docid, 'docid', docid)
        for key, value in cleaned_dict.items():
            if key not in dt_matrix.columns.values:
                dt_matrix[key] = 0 # add one column, init all 0
            dt_matrix.set_value(docid, key, value) # bottleneck
        docid += 1

So, the bottleneck adds new rows and columns to pandas. Any idea?

+4

python algorithm cosine-similarity tf-idf feature-selection

user1024 24 . '15 3:44

2

orange · Answer 1 · 2015-12-24T06:11:58+0000

Pandas DataFrames ( numpy) , . set_value ! dt_matrix = pd.DataFrame(cleaned_dict), DataFrame ( Pandas).

:

dt_matrix = pd.DataFrame()

for docid, f in enumerate(files):
    dt_matrix_file = pd.DataFrame(cleaned_dict)
    dt_matrix_file['docid'] = docid
    dt_matrix = dt_matrix.append(dt_matrix_file)

.

NaN , dt_matrix.fillna(0) ( , n * m).

Mani alapati · Answer 2 · 2019-04-15T07:13:38+0000

Python. R Python.

Python , K- ?

How to efficiently get top K-like cosine-like document using python?

Update1

More articles: