How to sort strings of a binary array as if they were long binary numbers?

There is an array of 2D numpy sized about 500,000 lines with 512 values ​​of each line:

[
  [1,0,1,...,0,0,1], # 512 1 or 0's
  [0,1,0,...,0,1,1],
  ...
  [0,0,1,...,1,0,1], # row number 500000
]

How to sort lines in ascending order, as if each line was a long 512-bit integer?

[
  [0,0,1,...,1,0,1],
  [0,1,0,...,0,1,1],
  [1,0,1,...,0,0,1],
  ...
]
+4
source share
4 answers

Instead of converting to strings, you can also use a view void(from @Jaime here ) of the data and argsortto that.

def sort_bin(b):
    b_view = np.ascontiguousarray(b).view(np.dtype((np.void, b.dtype.itemsize * b.shape[1])))
    return b[np.argsort(b_view.ravel())] #as per Divakar suggestion

Testing

np.random.seed(0)

b = np.random.randint(0, 2, (10,5))
print(b)
print(sort_bin(b))

[[0 1 1 0 1]
 [1 1 1 1 1]
 [1 0 0 1 0]
 ..., 
 [1 0 1 1 0]
 [0 1 0 1 1]
 [1 1 1 0 1]]
[[0 0 0 0 1]
 [0 1 0 1 1]
 [0 1 1 0 0]
 ..., 
 [1 1 1 0 1]
 [1 1 1 1 0]
 [1 1 1 1 1]]

It should be much faster and less intense in memory, because b_view- it's just a representation inb

t = np.random.randint(0,2,(2000,512))

%timeit sort_bin(t)
100 loops, best of 3: 3.09 ms per loop

%timeit np.array([[int(i) for i in r] for r in np.sort(np.apply_along_axis(lambda r: ''.join([str(c) for c in r]), 0, t))])
1 loop, best of 3: 3.29 s per loop

About 1000 times faster

+3
source

stable 512 , .

  • , ( )
  • ... ...
  • ,

: , :

11
01
00

, :

00
11
01

, 0s . , , , 01 00, , , , , :

00
01
11
0

string row, np.sort()

, array:

a = np.array([[1,0,0,0],[0,0,0,0],[1,1,1,1],[0,0,1,1]])

strings row, np.apply_along_axis:

a = np.apply_along_axis(lambda r: ''.join([str(c) for c in r]), 0, a)

a:

array(['1010', '0010', '0011', '0011'], dtype='<U4')

sort strings np.sort():

a = np.sort(a)

a:

array(['0010', '0011', '0011', '1010'], dtype='<U4')

:

a = np.array([[int(i) for i in r] for r in a])

a:

array([[0, 0, 1, 0],
       [0, 0, 1, 1],
       [0, 0, 1, 1],
       [1, 0, 1, 0]])

:

a = np.array([[int(i) for i in r] for r in np.sort(np.apply_along_axis(lambda r: ''.join([str(c) for c in r]), 0, a))])
0

It is slow, but it does the job.

def sort_col(arr, col_num=0):
# if we have sorted over all columns return array
if col_num >= arr.shape[1]:
    return arr

# sort array over given column
arr_sorted = arr[arr[:, col_num].argsort()]

# if the number of 1s in the given column is not equal to the total number
# of rows neither equal to 0, split on 1 and 0, sort and then merge
if len(arr) > np.sum(arr_sorted[:, col_num]) > 0:
    arr_sorted0s = sort_col(arr_sorted[arr_sorted[:, col_num]==0], col_num+1)
    arr_sorted1s = sort_col(arr_sorted[arr_sorted[:, col_num]==1], col_num+1)
    # change order of stacking if you want ascenting order
    return np.vstack((arr_sorted0s, arr_sorted1s))

# if the number of 1s in the given column is equal to the total number
# of rows or equal to 0, just go to the next iteration
return sort_col(arr_sorted, col_num + 1)



np.random.seed(0)
a = np.random.randint(0, 2, (5, 4))
print(a)
print(sort_col(a))

# prints
[[0 1 1 0]
 [1 1 1 1]
 [1 1 1 0]
 [0 1 0 0]
 [0 0 0 1]]
[[0 0 0 1]
 [0 1 0 0]
 [0 1 1 0]
 [1 1 1 0]
 [1 1 1 1]]

Change Or better yet, use the Daniels solution. I did not check for new answers before submitting my code.

0
source

Source: https://habr.com/ru/post/1688206/


All Articles