How to find the closest match based on 2 keys from one data block to another?

I have 2 data frames I'm working with. One has many places and coordinates (longitude, latitude). The other is a meteorological dataset with data from meteorological stations around the world and their respective coordinates. I am trying to associate the nearest weather station with every place in my dataset. The names of the weather stations and the names of my places do not match.

I give up attempts to link them by the closest match in coordinates and do not know where to start.

I was thinking about some use

np.abs((location['latitude']-weather['latitude'])+(location['longitude']-weather['longitude'])

Examples of each

a place...

 Location Latitude Longitude Component \ A 39.463744 -76.119411 Active B 39.029252 -76.964251 Active C 33.626946 -85.969576 Active D 49.286337 10.567013 Active E 37.071777 -76.360785 Active 

weather...

  Station Code Station Name Latitude Longitude US1FLSL0019 PORT ST. LUCIE 4.0 NE 27.3237 -80.3111 US1TXTV0133 LAKEWAY 2.8 W 30.3597 -98.0252 USC00178998 WALTHAM 44.6917 -68.3475 USC00178998 WALTHAM 44.6917 -68.3475 USC00178998 WALTHAM 44.6917 -68.3475 

The result will be a new column in the location data frame with the name of the station, which is the closest match

However, I am not sure how to execute the loop to execute this. Any help would be greatly appreciated.

Thanks Scott

+5
source share
2 answers

Let's say you have a distance function dist that you want to minimize:

 def dist(lat1, long1, lat2, long2): return np.abs((lat1-lat2)+(long1-long2)) 

For a given location, you can find the nearest station as follows:

 lat = 39.463744 long = -76.119411 weather.apply( lambda row: dist(lat, long, row['Latitude'], row['Longitude']), axis=1) 

This will allow you to calculate the distance to all weather stations. Using idxmin , you can find the name of the nearest station:

 distances = weather.apply( lambda row: dist(lat, long, row['Latitude'], row['Longitude']), axis=1) weather.loc[distances.idxmin(), 'StationName'] 

Put all this in a function:

 def find_station(lat, long): distances = weather.apply( lambda row: dist(lat, long, row['Latitude'], row['Longitude']), axis=1) return weather.loc[distances.idxmin(), 'StationName'] 

Now you can get all the nearest stations by applying it to the locations dataframe:

 locations.apply( lambda row: find_station(row['Latitude'], row['Longitude']), axis=1) 

Output:

 0 WALTHAM 1 WALTHAM 2 PORTST.LUCIE 3 WALTHAM 4 PORTST.LUCIE 
+4
source

Therefore, I appreciate that this is a bit dirty, but I used something similar to matching genetic data between tables. It depends on the longitude and latitude of the location file, which is within 5 of the data on the weather file, but if necessary, they can be changed.

 rows=range(location.shape[0]) weath_rows = range(weather.shape[0]) for r in rows: lat = location.iloc[r,1] max_lat = lat +5 min_lat = lat -5 lon = location.iloc[r,2] max_lon = lon +5 min_lon = lon -5 for w in weath_rows: if (min_lat <= weather.iloc[w,2] <= max_lat) and (min_lon <= weather.iloc[w,3] <= max_lon): location['Station_Name'] = weather.iloc[w,1] 
0
source

Source: https://habr.com/ru/post/1247878/


All Articles