I have a two-dimensional numpy array with hundreds of thousands of rows and a thousand columns (let's say it's an N x P array with N = 200,000, P = 1000). The goal here is to calculate the number of identical elements between each pair of row vectors, ideally using the numpy array magic, which does not require me to loop over 199,999 * 100,000 such pairs. Since it is probably not practical to store an array of 200,000 x 200,000, the output is likely to be in the allowed Nx3 coordinate format, for example. if the input is in the form:
5 12 14 200 0 45223
7 12 14 0 200 60000
7 6 23 0 0 45223
5 6 14 200 0 45223
the resulting (dense) matrix NxN M will be (without worrying about diagonal elements):
0 2 2 4
2 0 2 1
2 2 0 3
4 1 3 0
Mij j, 0.
:
0 1 2
0 2 2
0 3 4
1 2 2
1 3 1
2 3 3
, :
import itertools
import numpy as np
def pairwise_identical_elements(small_matrix):
n, p = small_matrix.shape
coordinates = itertools.combinations(range(n), 2)
sparse_coordinate_matrix = []
for row1, row2 in itertools.combinations(small_matrix, 2):
idx1, idx2 = next(coordinates)
count = p - np.count_nonzero(row1 - row2)
sparse_coordinate_matrix.append([idx1, idx2, count])
return sparse_coordinate_matrix
, Jaccard scipy sklearn, , . , (, "9" 1 9- ), ( "45223" ).
, / pythonic , numpy scipy , ?
: scipy , , , scipy. sparse.distance.pdist . "" , , , : ?