Efficient way to calculate the distance between pandas frame column combinations

Task

I have a pandas dataframe where:

  • columns are document names
  • strings are words in these documents
  • the numbers inside the cells of the frame are a measure of the relevance of the word (the number of words if you want to keep it simple).

I need to compute a new similarity matrix doc1-doc, where:

  • rows and columns are document names
  • the cells inside the frame are a measure of similarity, (1 - cosine distance) between two documents

Cosine distance is convenient to use script.spatial.distance.cosine .

I am currently doing this:

  • use itertools to create a list of all 2 combinations of document names (data column names)
  • let's move on to them and create a dictionary update {doc1: {doc2: similarities}}
  • after the loop, create a new frame using pandas.DataFrame (dict)

Problem

But it takes a lot of time. Below is the current speed on the MacBook Pro 13 with 16 GB of RAM and 2.9 GHz i5cpu, working with the latest anaconda python 3.5 ... plotting the timeline used for document combinations.

schedule calculation performance

You can see that 100,000 combinations take 1200 seconds. By extrapolating this to my 7944 document body, which creates combinations of 3 1,549,596 , it will take 5 days to calculate this similarity matrix!

Any ideas?

  • I previously dynamically created dataframe df.ix [doc1, doc2] = similarity .. which was very much slower.
  • I reviewed numba @ git, but it does not work with pandas data structures.
  • I can not find a built-in function that will do all the work inside (in C?)
  • What I have to do tactfully is to randomly select documents to create a much smaller set to work with ... currently, a fraction of 0.02 results in an approximately 20-minute calculation!

Here's the code ( github )

docs_combinations = itertools.combinations(docs_sample, 2) for doc1, doc2 in docs_combinations: # scipy cosine similarity function includes normalising the vectors but is a distance .. so we need to take it from 1.0 doc_similarity_dict[doc2].update({doc1: 1.0 - scipy.spatial.distance.cosine(relevance_index[doc1],relevance_index[doc2])}) pass #convert dict to pandas dataframe doc_similarity_matrix = pandas.DataFrame(doc_similarity_dict) 

Simple example

@MaxU asked for an illustrative example.

Relevance matrix (the phrase is here just to make it simple):

 ... doc1 doc2 doc3 wheel 2. 3. 0. seat 2. 2. 0. lights 0. 1. 1. cake 0. 0. 5. 

calculated similarity matrix based on 2 combinations (doc1, doc2), (doc2, doc3), (doc1, doc3)

 ... doc2 doc3 doc1 0.9449 0. doc2 - 0.052 

Take this top left value of 0.889 .. this is a point product (2 * 3 + 2 * 2 + 0 + 0) = 10, but normalized by the lengths of the vectors ... so we divide by sqrt (8) and sqrt (14) gives 0.9449 . You can see that there is no similarity between doc1 and doc3. The point product is zero.

Change this from 3 documents with 4 words ... to 7,944 documents that create 3 1,549,596 combinations ...

+6
source share
2 answers

Numba would be a good solution for this. As I think you know, it does not support Pandas DataFrames, but it is built around NumPy arrays. This is not a problem - you can easily and quickly convert your DataFrame to a 2D array and pass it to the Numba function (which will be pretty much the code you already have, just decorated with @njit at the top).

Also note that instead of dict-of-dicts for the results, you can use one triangle of a square matrix to store them:

  doc1 doc2 doc3 doc1 NAN NAN NAN doc2 ... NAN NAN doc3 ... ... NAN 

Edit: Now you implemented it using Numba, but only got 2.5x speedup. I conducted several experiments and found a big victory:

 In [66]: x = np.random.random((1000,1000)) In [67]: y = np.array(x, order='F') In [68]: %timeit similarity_jit(x) 1 loop, best of 3: 13.7 s per loop In [69]: %timeit similarity_jit(y) 1 loop, best of 3: 433 ms per loop 

Thus, your algorithm will be much faster if you work with adjacent pieces of data due to caching. Since the core of your algorithm is numpy.dot(m[:,i], m[:,j]) , and m[:,i] occupies one column, you better orient your data in the “Fortran order” (column order) , so m[:,i] gives one continuous array (because the array is laid out "transposed" in memory).

+1
source

This is about as efficient as I can make an algorithm without going into multiprocessing (bleh). The function uses numpy arrays for all calculations.

 def cos_sim(data_frame): # create a numpy array from the data frame a = data_frame.values # get the number of documents n = a.shape[-1] # create an array of size docs x docs to populate out = np.ravel(np.zeros(shape=(n, n))) for i in range(n): # roll the array one step at a time, calculating the cosine similarity each time r = np.roll(a, -i, axis=1) cs = np.sum(a[:,:ni]*r[:,:ni], axis=0) / ( np.sqrt(np.sum(a[:,:ni]*a[:,:ni], axis=0)) *np.sqrt(np.sum(r[:,:ni]*r[:,:ni], axis=0))) # push the cosine similarity to the output array i-th off-diagonal out[i:n*ni*n:n+1] = cs return out.reshape((n,n)) 
+2
source

Source: https://habr.com/ru/post/1012370/


All Articles