Given a sparse matrix (created with scipy.sparse.csr_matrix) of size NxN (N = 900,000), I try to find, for each row in the test set, the top k nearest neighbors (sparse row vectors from the input matrix) using a custom distance metric. Basically, each row of the input matrix represents an element and for each element (row) in the test suite, I need to find its knn.
Attempts:
Tried to use sklearn.neighbors.NearestNeighbor. However, it seems that sklearn does not accept the called metric function as input when working with sparse matrices:
ValueError: metric '<function <lambda> at 0x7f92ce221938>' not valid for sparse input
We are currently trying to use facebookresearch / pysparnn (looks very promising!). This has a certain condition for the implementation of one own class of distance. However, after execution, it takes quite a long time to build the index (it still works after 24 hours), and as mentioned by the author, it seems that
using distance types from scipy.spatial.distance.cdist(or sklearn distance metrics) is much slower than what is currently in pysparnn.
We are in the process of debugging this sklearn / scipy metrics performance issue by recording something custom.
, - , ?
( 64 , 12 )
!