How to efficiently get top K-like cosine-like document using python?

I process one hundred thousand (100,000) documents (the average length of a document is about 500 terms). For each document I want to get similar top k documents (for example, k = 5) by the similarity of cosines. So how to do this efficiently with Python .

Here is what I did:

  • for each document, perform text segmentation, delete stop words, counter frequency (tf)
  • so we get a tf matrix, about 100,000 documents * 600,000 terms
  • do 1 - pairwise_distances (tf_matrix, metric = "cosine")
  • for each document, get the best similar documents.

I run my code on i5-2.5GHz, 12 hours, but it still works. So I want to know how to optimize my code or procedure.

Here is my thought:

  • for each document, make a function selection, just save the terms, tf> 1
  • clustering first, then calculate the cosine similarity in each cluster
  • since I just need the best similar documents, do I need to calculate all paired cosine similarities?
  • Python programming or concurrent programming?

So do you have a good idea?

Many thanks.


I know there is a similar question , but that is not what I want.


Update1

Thanks to @orange, after profiling, I found that step 2 was a bottleneck! Here is a sample code:

def construct_dt_matrix():
    dt_matrix = pd.DataFrame(columns=['docid'])
    docid = 0
    for f in files:
        # text segmentation for f
        # remove stop words
        # word count store in cleaned_dict = {'word': tf}
        dt_matrix.loc[docid] = [0] * dt_matrix.shape[1] # add one row, init all 0
        dt_matrix.set_value(docid, 'docid', docid)
        for key, value in cleaned_dict.items():
            if key not in dt_matrix.columns.values:
                dt_matrix[key] = 0 # add one column, init all 0
            dt_matrix.set_value(docid, key, value) # bottleneck
        docid += 1

So, the bottleneck adds new rows and columns to pandas. Any idea?

+4
2

Pandas DataFrames ( numpy) , . set_value ! dt_matrix = pd.DataFrame(cleaned_dict), DataFrame ( Pandas).

:

dt_matrix = pd.DataFrame()

for docid, f in enumerate(files):
    dt_matrix_file = pd.DataFrame(cleaned_dict)
    dt_matrix_file['docid'] = docid
    dt_matrix = dt_matrix.append(dt_matrix_file)

.

NaN , dt_matrix.fillna(0) ( , n * m).

0

Python. R Python.

Python , K- ?

0

Source: https://habr.com/ru/post/1621565/


All Articles