Create an intersection of two or more 2d numpy arrays based on the total value in one column

Question

Create an intersection of two or more 2d numpy arrays based on the total value in one column

I have 3 numpy recarrays with the following structure. The first column is a position (integer), and the second column is an estimate (Float).

Input:

a = [[1, 5.41], [2, 5.42], [3, 12.32], dtype=[('position', '<i4'), ('score', '<f4')]) ] b = [[3, 8.41], [6, 7.42], [4, 6.32], dtype=[('position', '<i4'), ('score', '<f4')]) ] c = [[3, 7.41], [7, 6.42], [1, 5.32], dtype=[('position', '<i4'), ('score', '<f4')]) ]

All 3 arrays contain the same number of elements.
I am looking for an efficient way to combine these three 2d arrays into one array based on the position column.

The arary output for the above example should look like this:

Output:

 output = [[3, 12.32, 8.41, 7.41], dtype=[('position', '<i4'), ('score1', '<f4'),('score2', '<f4'),('score3', '<f4')])]

Only a line with position 3 is in the output array, because this position is displayed in all three input arrays.

Update . My naive approach will follow these steps:

create the vector of the first columns of my 3 input arrays.
use intersect1D to get the intersection of these three vectors.
somehow extracting the indices for the vector for all 3 input arrays.
create a new array with filtered strings from 3 input arrays.

Update2 : Each position value can be in one, two, or all three input arrays. In my output array, I only want to include strings for position values that appear in all 3 input arrays.

+4

python arrays set numpy intersection

Ümit Jan 23 '12 at 16:31

source share

1 answer

Bi rico · Accepted Answer · 2012-01-23T20:14:51+0000

Here is one approach, I think it should be fast enough. I think the first thing you want to do is count the number of occurrences for each position. This function will handle this:

 def count_positions(positions): positions = np.sort(positions) diff = np.ones(len(positions), 'bool') diff[:-1] = positions[1:] != positions[:-1] count = diff.nonzero()[0] count[1:] = count[1:] - count[:-1] count[0] += 1 uniqPositions = positions[diff] return uniqPositions, count

Now, using the form of the function above, you want to use only three positions:

 positions = np.concatenate((a['position'], b['position'], c['position'])) uinqPos, count = count_positions(positions) uinqPos = uinqPos[count == 3]

We will use sorted search so that we sort b and c:

 a.sort(order='position') b.sort(order='position') c.sort(order='position')

Now we can find a custom search to find where in each array to find each of our uniqPos:

 new_array = np.empty((len(uinqPos), 4)) new_array[:, 0] = uinqPos index = a['position'].searchsorted(uinqPos) new_array[:, 1] = a['score'][index] index = b['position'].searchsorted(uinqPos) new_array[:, 2] = b['score'][index] index = c['position'].searchsorted(uinqPos) new_array[:, 3] = c['score'][index]

There might have been a more elegant solution using dictionaries, but I thought about it, so I will leave it to someone else.

Create an intersection of two or more 2d numpy arrays based on the total value in one column

More articles: