Efficient pairwise calculation of identical elements in a large two-dimensional numpy matrix

I have a two-dimensional numpy array with hundreds of thousands of rows and a thousand columns (let's say it's an N x P array with N = 200,000, P = 1000). The goal here is to calculate the number of identical elements between each pair of row vectors, ideally using the numpy array magic, which does not require me to loop over 199,999 * 100,000 such pairs. Since it is probably not practical to store an array of 200,000 x 200,000, the output is likely to be in the allowed Nx3 coordinate format, for example. if the input is in the form:

5 12 14 200   0 45223
7 12 14   0 200 60000
7  6 23   0   0 45223
5  6 14 200   0 45223

the resulting (dense) matrix NxN M will be (without worrying about diagonal elements):

0 2 2 4
2 0 2 1
2 2 0 3
4 1 3 0

Mij j, 0. :

0 1 2
0 2 2
0 3 4
1 2 2 
1 3 1
2 3 3

, :

import itertools
import numpy as np

def pairwise_identical_elements(small_matrix):
    n, p = small_matrix.shape
    coordinates = itertools.combinations(range(n), 2)
    sparse_coordinate_matrix = []
    for row1, row2 in itertools.combinations(small_matrix, 2):
        idx1, idx2 = next(coordinates)
        count = p - np.count_nonzero(row1 - row2)
        sparse_coordinate_matrix.append([idx1, idx2, count])
    return sparse_coordinate_matrix

, Jaccard scipy sklearn, , . , (, "9" 1 9- ), ( "45223" ).

, / pythonic , numpy scipy , ?

: scipy , , , scipy. sparse.distance.pdist . "" , , , : ?

+4
1

, scipy pdist "" - , .

, . , "" (N*(N-1)/2, 3) N*(N-1)/2 pdist

0

Source: https://habr.com/ru/post/1686120/


All Articles