Pandas Slowly. Want the first appearance in a DataFrame

I have a DataFrame people. One of the columns in this DataFrame is place_id. I also have a DataFrame of places where one of the columns place_idand the other is weather. For every person I try to find the appropriate weather. It is important to note that many people have the same place_ids.

Currently my setup is this:

def place_id_to_weather(pid):
    return place_df[place_df['place_id'] == pid]['weather'].item() 

person_df['weather'] = person_df['place_id'].map(place_id_to_weather)`

But it is imperceptibly slow. I would like to speed it up. I suspect that I can achieve such an acceleration:

Instead of returning place_df[...].item(), which searches place_id == pidfor the entire column and returns the series, and then captures the first element in this series, I really just want to shorten the search in place_dfafter the first match is found place_df['place_id']==pid. After that, I no longer need to search. How to limit the search to only the first entry?

Are there other methods that can be used to speed up here? Some kind of connection method?

+4
source share
3 answers

It seems to me that you need drop_duplicatesc merge, if in both DataFramesthere are only common columns place_idand weather, you can omit the parameter on(this depends on the data, it may on='place_id'be necessary):

df1 = place_df.drop_duplicates(['place_id'])
print (df1)

print (pd.merge(person_df, df1))

Sample data:

person_df = pd.DataFrame({'place_id':['s','d','f','s','d','f'],
                          'A':[4,5,6,7,8,9]})
print (person_df)
   A place_id
0  4        s
1  5        d
2  6        f
3  7        s
4  8        d
5  9        f

place_df = pd.DataFrame({'place_id':['s','d','f', 's','d','f'],
                         'weather':['y','e','r', 'h','u','i']})
print (place_df)
  place_id weather
0        s       y
1        d       e
2        f       r
3        s       h
4        d       u
5        f       i
def place_id_to_weather(pid):
    #for first occurence add iloc[0]
    return place_df[place_df['place_id'] == pid]['weather'].iloc[0]

person_df['weather'] = person_df['place_id'].map(place_id_to_weather)
print (person_df)
   A place_id weather
0  4        s       y
1  5        d       e
2  6        f       r
3  7        s       y
4  8        d       e
5  9        f       r

#keep='first' is by default, so can be omit
print (place_df.drop_duplicates(['place_id']))
  place_id weather
0        s       y
1        d       e
2        f       r

print (pd.merge(person_df, place_df.drop_duplicates(['place_id'])))
   A place_id weather
0  4        s       y
1  7        s       y
2  5        d       e
3  8        d       e
4  6        f       r
5  9        f       r
+2

map - , . , , .. , , . , dataframe place_df:

person_df['weather'] = person_df['place_id'].map(dict(zip(place_df.place_id, place_df.weather)))
+1

You can use mergeto perform the operation:

people = pd.DataFrame([['bob', 1], ['alice', 2], ['john', 3], ['paul', 2]], columns=['name', 'place'])

#    name  place
#0    bob      1
#1  alice      2
#2   john      3
#3   paul      2

weather = pd.DataFrame([[1, 'sun'], [2, 'rain'], [3, 'snow'], [1, 'rain']], columns=['place', 'weather'])

#   place weather
#0      1     sun
#1      2    rain
#2      3    snow
#3      1    rain

pd.merge(people, weather, on='place')

#    name  place weather
#0    bob      1     sun
#1    bob      1    rain
#2  alice      2    rain
#3   paul      2    rain
#4   john      3    snow

If you have several weathers for the same place, you can use drop_duplicates, then you will get the following result:

pd.merge(people, weather, on='place').drop_duplicates(subset=['name', 'place'])

#    name  place weather
#0    bob      1     sun
#2  alice      2    rain
#3   paul      2    rain
#4   john      3    snow
0
source

Source: https://habr.com/ru/post/1657379/


All Articles