Computing between columns (longitude / latitude) is very slow

I have two separate data sets, dfand df2, each data set has columns longitudeand latitude. What I'm trying to do is find the point in dfwhich is closest to the point in df2, and calculate the distance between them in kmand add each value to a new column in df2.

I came up with a solution, but keep in mind that it dfhas strings +700,000and df2has about 60,000strings, so my solution will take too long to calculate. The only solution I could come up with is to use a double loop for...

def compute_shortest_dist(df, df2):
    # array to store all closest distances
    shortest_dist = []

    # radius of earth (used for calculation)
    R = 6373.0
    for i in df2.index:
        # keeps track of current minimum distance
        min_dist = -1

        # latitude and longitude from df2
        lat1 = df2.ix[i]['Latitude']
        lon1 = df2.ix[i]['Longitude']

        for j in df.index:

            # the following is just the calculation necessary
            # to calculate the distance between each point in km
            lat2 = df.ix[j]['Latitude']
            lon2 = df.ix[j]['Longitude']
            dlon = lon2 - lon1
            dlat = lat2 - lat1
            a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
            c = 2 * atan2(sqrt(a), sqrt(1 - a))
            distance = R * c

            # store new shortest distance
            if min_dist == -1 or distance > min_dist:
                min_dist = distance
        # append shortest distance to array
        shortest_dist.append(min_dist)

, , , pandas.

.

+4
1

numpy, :

import numpy as np

def compute_shortest_dist(df, df2):
    # array to store all closest distances
    shortest_dist = []

    # radius of earth (used for calculation)
    R = 6373.0
    lat1 = df['Latitude']
    lon1 = df['Longitude']
    for i in df2.index:
        # the following is just the calculation necessary
        # to calculate the distance between each point in km
        lat2 = df2.loc[i, 'Latitude']
        dlat = lat1 - lat2
        dlon = lon1 - df2.loc[i, 'Longitude']
        a = np.sin(dlat / 2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2)**2
        distance = 2* R * np.arctan2(np.sqrt(a), np.sqrt(1 - a))

        # append shortest distance to array
        shortest_dist.append(distance.min())
    return shortest_dist
+2

Source: https://habr.com/ru/post/1693398/


All Articles