Create an intersection of two or more 2d numpy arrays based on the total value in one column

I have 3 numpy recarrays with the following structure. The first column is a position (integer), and the second column is an estimate (Float).

Input:

a = [[1, 5.41], [2, 5.42], [3, 12.32], dtype=[('position', '<i4'), ('score', '<f4')]) ] b = [[3, 8.41], [6, 7.42], [4, 6.32], dtype=[('position', '<i4'), ('score', '<f4')]) ] c = [[3, 7.41], [7, 6.42], [1, 5.32], dtype=[('position', '<i4'), ('score', '<f4')]) ] 

All 3 arrays contain the same number of elements.
I am looking for an efficient way to combine these three 2d arrays into one array based on the position column.

The arary output for the above example should look like this:

Output:

 output = [[3, 12.32, 8.41, 7.41], dtype=[('position', '<i4'), ('score1', '<f4'),('score2', '<f4'),('score3', '<f4')])] 

Only a line with position 3 is in the output array, because this position is displayed in all three input arrays.

Update . My naive approach will follow these steps:

  • create the vector of the first columns of my 3 input arrays.
  • use intersect1D to get the intersection of these three vectors.
  • somehow extracting the indices for the vector for all 3 input arrays.
  • create a new array with filtered strings from 3 input arrays.

Update2 : Each position value can be in one, two, or all three input arrays. In my output array, I only want to include strings for position values ​​that appear in all 3 input arrays.

+4
source share
1 answer

Here is one approach, I think it should be fast enough. I think the first thing you want to do is count the number of occurrences for each position. This function will handle this:

 def count_positions(positions): positions = np.sort(positions) diff = np.ones(len(positions), 'bool') diff[:-1] = positions[1:] != positions[:-1] count = diff.nonzero()[0] count[1:] = count[1:] - count[:-1] count[0] += 1 uniqPositions = positions[diff] return uniqPositions, count 

Now, using the form of the function above, you want to use only three positions:

 positions = np.concatenate((a['position'], b['position'], c['position'])) uinqPos, count = count_positions(positions) uinqPos = uinqPos[count == 3] 

We will use sorted search so that we sort b and c:

 a.sort(order='position') b.sort(order='position') c.sort(order='position') 

Now we can find a custom search to find where in each array to find each of our uniqPos:

 new_array = np.empty((len(uinqPos), 4)) new_array[:, 0] = uinqPos index = a['position'].searchsorted(uinqPos) new_array[:, 1] = a['score'][index] index = b['position'].searchsorted(uinqPos) new_array[:, 2] = b['score'][index] index = c['position'].searchsorted(uinqPos) new_array[:, 3] = c['score'][index] 

There might have been a more elegant solution using dictionaries, but I thought about it, so I will leave it to someone else.

+3
source

Source: https://habr.com/ru/post/1392501/


All Articles