Find the most common row or vector matrix mode - Python / NumPy

I have a numpy array of form (?, N) that represents a vector of n-dimensional vectors.

I want to find the most frequent line.

It still seems that the best way is to simply iterate over all the records and save the score, but it seems indecent that numpy or scipy will not have something built in to complete this task.

+4
source share
3 answers

Here's a use approach NumPy viewsthat should be pretty effective -

def mode_rows(a):
    a = np.ascontiguousarray(a)
    void_dt = np.dtype((np.void, a.dtype.itemsize * np.prod(a.shape[1:])))
    _,ids, count = np.unique(a.view(void_dt).ravel(), \
                                return_index=1,return_counts=1)
    largest_count_id = ids[count.argmax()]
    most_frequent_row = a[largest_count_id]
    return most_frequent_row

Run Example -

In [45]: # Let have a random arrayb with three rows(2,4,8) and two rows(1,7)
    ...: # being duplicated. Thus, the most freequent row must be 2 here.
    ...: a = np.random.randint(0,9,(9,5))
    ...: a[4] = a[8]
    ...: a[2] = a[4]
    ...: 
    ...: a[1] = a[7]
    ...: 

In [46]: a
Out[46]: 
array([[8, 8, 7, 0, 7],
       [7, 8, 2, 6, 1],
       [2, 2, 5, 7, 6],
       [6, 5, 8, 8, 5],
       [2, 2, 5, 7, 6],
       [5, 7, 3, 6, 3],
       [2, 8, 7, 2, 0],
       [7, 8, 2, 6, 1],
       [2, 2, 5, 7, 6]])

In [47]: mode_rows(a)
Out[47]: array([2, 2, 5, 7, 6])
+3
source

Pandas, , :

import numpy as np
import pandas as pd

# generate sample data
ncol = 5
nrow = 20000
matrix = np.random.randint(0,ncol,ncol*nrow).reshape(nrow,ncol)
df = pd.DataFrame(matrix)

df.head()
   0  1  2  3  4
0  3  0  4  4  4
1  4  0  0  2  0
2  3  3  2  0  0
3  0  3  4  3  3
4  1  1  3  3  3

# count duplicated rows
(df.groupby(df.columns.tolist())
   .size()
   .sort_values(ascending=False))

:

0  1  2  3  4
4  2  2  1  1    17
2  2  4  2  3    16
3  2  1  2  2    15
   1  2  4  3    15
                 ..
4  1  3  0  1     1
1  2  3  0  4     1

- . - .

0

The numpy_indexed package (dsiclaimer: I am its author) has functionality that does just that, which works on any number of dimensions:

import numpy_indexed as npi
row = npi.mode(arr)

Under the hood, this is similar to Divacar’s decision in terms of algorithms and complexity, with a few more bells and whistles; see "wights" and "return_indices" kwargs.

0
source

Source: https://habr.com/ru/post/1675389/


All Articles