A quick (vectorized) way to find points in one DF belonging to rectangles of the same size (given by two points) from the second DF

I have a data frame "A" that looks like this:

type latw lngs late lngn 0 1000 45.457966 9.174864 45.458030 9.174907 1 1000 45.457966 9.174864 45.458030 9.174907 2 1000 45.458030 9.174864 45.458094 9.174907 3 1000 45.458094 9.174864 45.458157 9.174907 4 1000 45.458157 9.174864 45.458221 9.174907 5 1000 45.458221 9.174864 45.458285 9.174907 6 1000 45.458285 9.174864 45.458349 9.174907 7 1000 45.458349 9.174864 45.458413 9.174907 8 1000 45.458413 9.174864 45.458477 9.174907 9 1000 45.458477 9.174864 45.458540 9.174907 10 1000 45.458540 9.174864 45.458604 9.174907 11 1000 45.458604 9.174864 45.458668 9.174907 12 1000 45.458668 9.174864 45.458732 9.174907 13 1000 45.458732 9.174864 45.458796 9.174907 14 1000 45.458796 9.174864 45.458860 9.174907 15 1000 45.458860 9.174864 45.458923 9.174907 16 1000 45.458923 9.174864 45.458987 9.174907 17 1000 45.458987 9.174864 45.459051 9.174907 18 1000 45.459051 9.174864 45.459115 9.174907 19 1000 45.459115 9.174864 45.459179 9.174907 20 1000 45.459179 9.174864 45.459243 9.174907 21 1000 45.459243 9.174864 45.459306 9.174907 22 1000 45.459306 9.174864 45.459370 9.174907 23 1000 45.459370 9.174864 45.459434 9.174907 24 1000 45.459434 9.174864 45.459498 9.174907 25 1000 45.459498 9.174864 45.459562 9.174907 26 1000 45.459562 9.174864 45.459626 9.174907 27 1000 45.459626 9.174864 45.459689 9.174907 28 1000 45.459689 9.174864 45.459753 9.174907 29 1000 45.459753 9.174864 45.459817 9.174907 ... ... ... ... ... ... 970 1000 45.460583 9.175545 45.460647 9.175587 971 1000 45.460647 9.175545 45.460711 9.175587 972 1000 45.460711 9.175545 45.460775 9.175587 973 1000 45.460775 9.175545 45.460838 9.175587 974 1000 45.460838 9.175545 45.460902 9.175587 975 1000 45.460902 9.175545 45.460966 9.175587 976 1000 45.460966 9.175545 45.461030 9.175587 977 1000 45.461030 9.175545 45.461094 9.175587 978 1000 45.461094 9.175545 45.461157 9.175587 979 1000 45.461157 9.175545 45.461221 9.175587 980 1000 45.461221 9.175545 45.461285 9.175587 981 1000 45.461285 9.175545 45.461349 9.175587 982 1000 45.461349 9.175545 45.461413 9.175587 983 1000 45.461413 9.175545 45.461477 9.175587 984 1000 45.461477 9.175545 45.461540 9.175587 985 1000 45.461540 9.175545 45.461604 9.175587 986 1000 45.461604 9.175545 45.461668 9.175587 987 1000 45.457966 9.175587 45.458030 9.175630 988 1000 45.458030 9.175587 45.458094 9.175630 989 1000 45.458094 9.175587 45.458157 9.175630 990 1000 45.458157 9.175587 45.458221 9.175630 991 1000 45.458221 9.175587 45.458285 9.175630 992 1000 45.458285 9.175587 45.458349 9.175630 993 1000 45.458349 9.175587 45.458413 9.175630 994 1000 45.458413 9.175587 45.458477 9.175630 995 1000 45.458477 9.175587 45.458540 9.175630 996 1000 45.458540 9.175587 45.458604 9.175630 997 1000 45.458604 9.175587 45.458668 9.175630 998 1000 45.458668 9.175587 45.458732 9.175630 999 1000 45.458732 9.175587 45.458796 9.175630 

It contains 22,000,000 rows Γ— 5 columns and there is a data frame β€œB” that looks like this:

  type Lat Lng 0 0 45.465739 9.180830 1 2 45.463950 9.187113 2 1 45.468015 9.180648 3 1 45.462209 9.187447 4 0 45.459578 9.184007 5 1 45.459822 9.187034 6 2 45.454988 9.180310 7 2 45.459818 9.189377 8 0 45.462200 9.187440 9 0 45.467160 9.180100 10 2 45.459407 9.183300 11 2 45.457699 9.187434 12 1 45.455319 9.186697 13 0 45.461138 9.191943 14 2 45.456397 9.189028 15 0 45.457062 9.185878 16 1 45.461980 9.187024 17 1 45.464319 9.183142 18 2 45.464227 9.187065 19 1 45.460886 9.185216 

It has 2,000,000 rows Γ— 3 columns. I want to replace the value of the data frame type "A" with "B" where:

 A[latw]<B[lat]<A[late] and A[lngs]<B[lng]<B[lngn] 

I want to check that the location from B belongs to one of the rectangles in A.

PS I'm looking for the fastest way in python, for example, using parallel processing.

+5
source share
3 answers

High Level Explanation

This answer is very much in demand. Below are the high level points

  • Use np.searchsorted with west latitudes in A relative to sorted latitudes in B Repeat this for eastern latitudes, but with the side='right' parameter. This tells me where every west-east latitude falls in the spectrum defined in B np.searchsorted - index values. Therefore, if the results in the east are greater than the results in the west, this means that there was a latitude from B between the west and the east.
    • Track where the eastern search exceeds the western search.
    • Find the difference. Not only does it B.Lat me if the eastern search exceeds the western, the difference tells me how much B.Lat is between them. I am expanding arrays to track all possible matches.
  • Repeat this process for the southern and northern longitudes, but only over the subset in which we found the corresponding latitudes. No need to spend effort looking where we know that we have no match.
    • Perform complex slicing to unwind positions.
  • Use numpy structured arrays to find intersections of two-dimensional arrays
  • This last intersection has ordinal positions from A and matching ordinal positions from B
    • Remember, I said that we can match more than one line from B , I take the first.

Demonstration

use A and B provided by @MaxU

 p = process(A, B) print(p) [[3 0] [5 1] [9 2]] 

To show results and compare. I'll put them side by side

 pd.concat([ A.iloc[p[:, 0]].reset_index(drop=True), B.iloc[p[:, 1]].reset_index(drop=True) ], axis=1, keys=['A', 'B']) 

enter image description here

However, to replace type values, I put a flag in the process function to reassign. Just run process(A, B, reassign_type=True)

time
Functions
I defined functions to return the same thing and make a comparison of apples with apples

 def pir(A, B): return A.values[process(A, B)[:, 0], 0] def maxu(A, B): return B.apply( lambda x: A.loc[ (A.latw < x.Lat) & (x.Lat < A.late) & (A.lngs < x.Lng) & (x.Lng < A.lngn), 'type' ].head(1), axis=1 ).values.diagonal() 

enter image description here

more rigorous testing

 from timeit import timeit from itertools import product rng = np.arange(20, 51, 5) idx = pd.MultiIndex.from_arrays([rng, rng], names=['B', 'A']) functions = pd.Index(['pir', 'maxu'], name='method') results = pd.DataFrame(index=idx, columns=functions) for bi, ai in idx: n = ai ws = np.random.rand(n, 2) * 10 + np.array([[40, 30]]) A = pd.DataFrame( np.hstack([-np.ones((n, 1)), ws, ws + np.random.rand(n, 2) * 2]), columns=['type', 'latw', 'lngs', 'late', 'lngn'] ) m = int(bi ** .5) prng = np.arange(m) / m * 10 + .5 B = pd.DataFrame( np.array(list(product(prng, prng))) + np.array([[40, 30]]), columns=['Lat', "Lng"]) B.insert(0, 'type', np.random.randint(3, size=len(B))) for j in functions: results.set_value( (bi, ai), j, timeit( '{}(A, B)'.format(j), 'from __main__ import {}, A, B'.format(j), number=10 ) ) results.plot(rot=45) 

enter image description here


Code

 from collections import namedtuple TwoSided = namedtuple('TwoSided', ['match', 'idx', 'found']) def two_sided_search(left, right, a): # sort `a` and save order for later argsort = a.argsort() asorted = a[argsort] # where does `left` fall in `asorted` s_left = asorted.searchsorted(left) # where does `right` fall in `asorted` s_right = asorted.searchsorted(right, side='right') rng = np.arange(left.size) # a match happens when where we found `left`, `right` was found later match_idx = rng[s_left < s_right] # ignore positions with no match left_pos = s_left[match_idx] right_pos = s_right[match_idx] # distance from where left was found to where right was found # this represents how many items in a are wrapped by left and right diff_pos = right_pos - left_pos # build an index on `match_idx` paired with all positions in # `asorted`. d2 = left_pos - np.append(0, diff_pos[:-1].cumsum()) p1 = np.arange(len(left_pos)).repeat(diff_pos) p2 = np.arange(diff_pos.sum()) + d2.repeat(diff_pos) # returns # `match_idx` which is an integer slice of either `left` or `right` # for each element in `match_idx` there are one or more elements in # `a` that were surrounded by `left` and `right` # `p1` are the positions in `match_idx` that are paired with `argsort[p2]` # for example, consider `p1 = [1, 1, 2, 2]` and `argsort[p2] = [3, 5, 12, 5]` # this means that `left[match_idx[1]] < a[3] < right[match_idx[1]]` # and `left[match_idx[1]] < a[5] < right[match_idx[1]]` # and `left[match_idx[2]] < a[12] < right[match_idx[2]]` # and `left[match_idx[2]] < a[5] < right[match_idx[2]]` return TwoSided(match_idx, p1, argsort[p2]) def intersectNd(a, b): # taken from /questions/1263020/get-intersecting-rows-across-two-2d-numpy-arrays/4018373#4018373 # returns the interesection of 2 2-D arrays # the strategy is to convert to a structured array then # use np.intersect1d then convert back to unstructured ncols = a.shape[1] dtype = dict( names=['f{}'.format(i) for i in range(ncols)], formats=ncols * [a.dtype] ) return np.intersect1d( a.view(dtype), b.view(dtype) ).view(a.dtype).reshape(-1, ncols) def where_in(b1, b2): # return a mask of lenght len(b2) where True # when element of b2 is in b1 bsort = b1.argsort() bsrtd = b1[bsort] bsrch = bsrtd.searchsorted(b2) % bsrtd.size return bsrtd[bsrch] == b2 def process(A, B, reassign_type=False): lats = two_sided_search( A.latw.values, A.late.values, B.Lat.values, ) # subset A & B # no point looking for longitude matches # in rows without a latitude match lngs = A.lngs.values[lats.match] lngn = A.lngn.values[lats.match] u, i = np.unique(lats.found, return_index=1) lng = B.Lng.values[u] lons = two_sided_search( lngs, lngn, lng ) # `lats.match` is where we found successful latitudes # `lons.match` is where we found successful longitudes # since `lons.match` was based on `lats.match` we can slice # to get the original positions of `A` that satisfy both # the latw < blat < late & lngs < blng < lngn match = lats.match[lons.match] # however, for every row in A that might be a match for B, # there may be multilple B's # # We start addressing this by finding # where in lats.idx do we have matches # this gives me a boolean array of lats.match # index values repeated for each row in B # that satisfies our conditions wi = where_in(lons.match, lats.idx) # a (? x 2) array representing all combinations of # rows in A that matched the lon_pos = np.hstack([ lats.match[lons.match[lons.idx], None], u[lons.found, None] ]) lat_pos = np.hstack([ lats.match[lats.idx[wi], None], lats.found[wi, None] ]) x = intersectNd(lat_pos, lon_pos) p, pi = np.unique(x[:, 0], return_index=1) if reassign_type: A.iloc[x[pi, 0], A.columns.get_loc('type')] = \ B.iloc[x[pi, 1], B.columns.get_loc('type')].values else: return x[pi] 
+3
source

You can do something like this:

Assuming we have the following DFs:

 In [150]: A Out[150]: type latw late lngs lngn 0 1000 45.457966 45.458030 9.174864 9.174907 1 1001 45.457966 45.458030 9.174864 9.174907 2 1002 45.458030 45.458094 9.174864 9.174907 3 1003 45.458094 45.458157 9.174864 9.174907 16 1004 45.458923 45.458987 9.174864 9.174907 17 1005 45.458987 45.459051 9.174864 9.174907 970 1006 45.460583 45.460647 9.175545 9.175587 971 1007 45.460647 45.460711 9.175545 9.175587 972 1008 45.460711 45.460775 9.175545 9.175587 996 1009 45.458540 45.458604 9.175587 9.175630 997 1010 45.458604 45.458668 9.175587 9.175630 998 1011 45.458668 45.458732 9.175587 9.175630 999 1012 45.458732 45.458796 9.175587 9.175630 In [151]: B Out[151]: type Lat Lng 0 0 45.4581 9.1749 1 1 45.4590 9.1749 2 2 45.4586 9.1756 

Decision:

 B['new'] = B.apply(lambda x: A.loc[ (A.latw < x.Lat) & (x.Lat < A.late) & (A.lngs < x.Lng) & (x.Lng < A.lngn), 'type'].head(1), axis=1) \ .values.diagonal() In [153]: B Out[153]: type Lat Lng new 0 0 45.4581 9.1749 1003.0 1 1 45.4590 9.1749 1005.0 2 2 45.4586 9.1756 1009.0 

PS I'm not sure if this is the fastest way to achieve this ...

+4
source

I launched the solution proposed by MaxU, based on 1000 data, and timeit shows 1.2 s, which is the best of 3. I achieved some improvement in time by creating a list β€œc” that contains the logical results of the comparison and using this list to assign as follows way:

 c=list((A.latw < B.Lat) & (A.late > B.Lat ) & (A.lngs < B.Lng) & (A.lngn>B.Lng)) B.apply(A.loc[ c , 'type'].head(1),axis=1).values.diagonal() 

The first part is executed in microsecond order, and the second part takes milliseconds.

Can you try it on a large dataset and see if it really improves performance.

+2
source

Source: https://habr.com/ru/post/1263015/


All Articles