Vectorization for calculating many distances

I am new to numpy / pandas and vectorized computing. I am performing a data task where I have two data sets. Dataset 1 contains a list of places with their longitude and latitude and variable A. Dataset 2 also contains a list of places with their longitude and latitude. For each place in dataset 1, I would like to calculate its distances to all places in dataset 2, but I would like to get only the number of places in dataset 2, which are less than the value of variable A. Note that both datasets are very great, so I need to use vectorized operations to speed up the calculation.

For example, my dataset1 might look like this:

id lon    lat   varA
1  20.11 19.88  100
2  20.87 18.65  90
3  18.99 20.75  120

and my dataset2 might look like this:

placeid lon lat 
a       18.75 20.77
b       19.77 22.56
c       20.86 23.76
d       17.55 20.74 

id == 1 1 (a, c, c, d) 2, , varA. , 90, 70, 120, 110, varA 100. 2.

. , (haversine (x, y)) , .

dataset2['count'] = dataset1.apply(lambda x: 
haversine(x['lon'],x['lat'],dataset2['lon'], dataset2['lat']).shape[0], axis 
= 1)

, , .

- , ?

+4
3

(, UTM), pyproj , lon/lat , MUCH, scipy.spatial. df['something'] = df.apply(...) np.vectorize() , .

ds1
    id  lon lat varA
0   1   20.11   19.88   100
1   2   20.87   18.65   90
2   3   18.99   20.75   120

ds2
    placeid lon lat
0   a   18.75   20.77
1   b   19.77   22.56
2   c   20.86   23.76
3   d   17.55   20.74


from scipy.spatial import distance

# gey coordinates of each set of points as numpy array
coords_a = ds1.values[:,(1,2)]
coords_b = ds2.values[:, (1,2)]
coords_a
#out: array([[ 20.11,  19.88],
#       [ 20.87,  18.65],
#       [ 18.99,  20.75]])

distances = distance.cdist(coords_a, coords_b)
#out: array([[ 1.62533074,  2.70148108,  3.95182236,  2.70059253],
#       [ 2.99813275,  4.06178532,  5.11000978,  3.92307278],
#       [ 0.24083189,  1.97091349,  3.54358575,  1.44003472]])

distances . coords_a.shape (3, 2) coords_b.shape (4, 2), (3,4). np.distance eculidean, . , vara:

vara = np.array([2,4.5,2])

( 100 90 120). , distances 1 , 2, , 4.5,..., - vara ( , vara):

vara.resize(3,1)
res = res - vara
#out: array([[-0.37466926,  0.70148108,  1.95182236,  0.70059253],
#       [-1.50186725, -0.43821468,  0.61000978, -0.57692722],
#       [-1.75916811, -0.02908651,  1.54358575, -0.55996528]])

:

res[res>0] = 0
res = np.absolute(res)
#out: array([[ 0.37466926,  0.        ,  0.        ,  0.        ],
#            [ 1.50186725,  0.43821468,  0.        ,  0.57692722],
#            [ 1.75916811,  0.02908651,  0.        ,  0.55996528]])

, :

sum_ = res.sum(axis=1)
#out:  array([ 0.37466926,  2.51700915,  2.34821989])

:

count = np.count_nonzero(res, axis=1)
#out: array([1, 3, 3])

() , . - cKDTree. . , , .

x, y = np.mgrid[0:4, 0:4]
points = zip(x.ravel(), y.ravel())
tree = spatial.cKDTree(points)
tree.query_ball_point([2, 0], 1)
[4, 8, 9, 12]

query_ball_point() r (s) x, .

: lon/lat, , .

UPDATE:

, WGS84 (lon/lat) UTM. , utm- , epsg.io.

lon = -122.67598
lat = 45.52168
WGS84 = "+init=EPSG:4326"
EPSG3740 = "+init=EPSG:3740"
Proj_to_EPSG3740 = pyproj.Proj(EPSG3740)

Proj_to_EPSG3740(lon,lat)
# out: (525304.9265963673, 5040956.147893889)

df.apply() Proj_to_... df.

+1

IIUC:

DFs:

In [160]: d1
Out[160]:
   id    lon    lat  varA
0   1  20.11  19.88   100
1   2  20.87  18.65    90
2   3  18.99  20.75   120

In [161]: d2
Out[161]:
  placeid    lon    lat
0       a  18.75  20.77
1       b  19.77  22.56
2       c  20.86  23.76
3       d  17.55  20.74

haversine:

def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
    if to_radians:
        lat1, lon1, lat2, lon2 = pd.np.radians([lat1, lon1, lat2, lon2])

    a = pd.np.sin((lat2-lat1)/2.0)**2 + \
        pd.np.cos(lat1) * pd.np.cos(lat2) * pd.np.sin((lon2-lon1)/2.0)**2

    return earth_radius * 2 * pd.np.arcsin(np.sqrt(a))

:

x = d2.assign(x=1) \
      .merge(d1.loc[d1['id']==1, ['lat','lon']].assign(x=1),
             on='x', suffixes=['','2']) \
      .drop(['x'], 1)

x['dist']  = haversine(x.lat, x.lon, x.lat2, x.lon2)

:

In [163]: x
Out[163]:
  placeid    lon    lat   lat2   lon2        dist
0       a  18.75  20.77  19.88  20.11  172.924852
1       b  19.77  22.56  19.88  20.11  300.078600
2       c  20.86  23.76  19.88  20.11  438.324033
3       d  17.55  20.74  19.88  20.11  283.565975

:

In [164]: x.loc[x.dist < d1.loc[d1['id']==1, 'varA'].iat[0]]
Out[164]:
Empty DataFrame
Columns: [placeid, lon, lat, lat2, lon2, dist]
Index: []

d1, :

In [171]: d1.loc[0, 'varA'] = 350

In [172]: d1
Out[172]:
   id    lon    lat  varA
0   1  20.11  19.88   350   # changed: 100 --> 350 
1   2  20.87  18.65    90
2   3  18.99  20.75   120

In [173]: x.loc[x.dist < d1.loc[d1['id']==1, 'varA'].iat[0]]
Out[173]:
  placeid    lon    lat   lat2   lon2        dist
0       a  18.75  20.77  19.88  20.11  172.924852
1       b  19.77  22.56  19.88  20.11  300.078600
3       d  17.55  20.74  19.88  20.11  283.565975
+1

Use scipy.spatial.distance.cdistwith your custom distance algorithm asmetric

h = lambda u, v: haversine(u['lon'], u['lat'], v['lon'], v['lat'])
dist_mtx = scipy.spatial.distance.cdist(dataset1, dataset2, metric = h)

Then, to check the number in the area, just pass it

dataset2['count'] = np.sum(dataset1['A'][:, None] > dist_mtx, axis = 0)
+1
source

Source: https://habr.com/ru/post/1684163/


All Articles