I have a large data frame (over 100 columns and several 100 thousand rows) with several rows that contain duplicate data. I am trying to remove duplicate rows while storing the value with the largest value in another column.
Essentially, I sort the data into separate cells based on a time period, so a lot of duplication would be expected in periods, since most entities exist in all time periods. However, one and the same object must not be allowed to appear more than once in a given period of time.
I tried the approach in python pandas: delete duplicates on columns A, keeping the row with the highest value in column B , on a subset of the data, with a recombination plan with the original data framework, df.
An example of a subset of data:
unique_id period_id liq
index
19 CAN00CE0 199001 0.017610
1903 **USA07WG0** 199001 1.726374
12404 **USA07WG0** 199001 0.090525
13330 USA08DE0 199001 1.397143
14090 USA04U80 199001 2.000716
12404 USA07WG0 199002 0.090525
13330 USA08DE0 199002 1.397143
14090 USA04U80 199002 2.000716
In the above example, I would like to save the first instance (since liq is higher from 1.72) and drop the second instance (liq is lower, from 0.09). Please note that there may be more than two duplicates for a given id_ period.
I tried this, but it was very slow for me (I stopped it for more than 5 minutes):
def h(x):
x = x.dropna()
return x.ix[x.liq.idmax()]
df.groupby([‘holt_unique_id’, ‘period_id’], group_keys = False).apply(lambda x: h(x))
In the end, I did the following, more verbose and ugly, and just throws away everything but one duplicate, but it is also very slow! Given the speed of other operations of similar complexity, I thought I would ask here to find the best solution.
, , , , , , , reset_index/set_index, :
def do_remove_duplicates(df):
sub_df = df[['period_id', 'unique_id']]
grp = sub_df.groupby(['period_id', 'unique_id'], as_index = False)
cln = grp.apply(lambda x: x.drop_duplicates(cols = 'unique_id'))
cln = cln.reset_index()
del(cln['level_0'])
cln.set_index('level_1', inplace = True)
df_cln = cln.join(df, how = 'left', rsuffix = '_right')
return df_cln