Pandas - remove duplicate rows except one with the highest value from another column

I have a large data frame (over 100 columns and several 100 thousand rows) with several rows that contain duplicate data. I am trying to remove duplicate rows while storing the value with the largest value in another column.

Essentially, I sort the data into separate cells based on a time period, so a lot of duplication would be expected in periods, since most entities exist in all time periods. However, one and the same object must not be allowed to appear more than once in a given period of time.

I tried the approach in python pandas: delete duplicates on columns A, keeping the row with the highest value in column B , on a subset of the data, with a recombination plan with the original data framework, df.

An example of a subset of data:

              unique_id   period_id   liq
index                                   
19            CAN00CE0     199001  0.017610
1903          **USA07WG0** 199001  1.726374
12404         **USA07WG0** 199001  0.090525
13330         USA08DE0     199001  1.397143
14090         USA04U80     199001  2.000716
12404         USA07WG0     199002  0.090525
13330         USA08DE0     199002  1.397143
14090         USA04U80     199002  2.000716

In the above example, I would like to save the first instance (since liq is higher from 1.72) and drop the second instance (liq is lower, from 0.09). Please note that there may be more than two duplicates for a given id_ period.

I tried this, but it was very slow for me (I stopped it for more than 5 minutes):

def h(x):
    x = x.dropna() #idmax fails on nas, and happy to throw out where liq is na.
    return x.ix[x.liq.idmax()]

df.groupby([‘holt_unique_id’, ‘period_id’], group_keys = False).apply(lambda x: h(x))

In the end, I did the following, more verbose and ugly, and just throws away everything but one duplicate, but it is also very slow! Given the speed of other operations of similar complexity, I thought I would ask here to find the best solution.

, , , , , , , reset_index/set_index, :

def do_remove_duplicates(df):
    sub_df = df[['period_id', 'unique_id']] 
    grp = sub_df.groupby(['period_id', 'unique_id'], as_index = False)
    cln = grp.apply(lambda x: x.drop_duplicates(cols = 'unique_id'))   #apply drop_duplicates.  This line is the slow bit!
    cln = cln.reset_index()   #remove the index stuff that has been added
    del(cln['level_0'])   #remove the index stuff that has been added
    cln.set_index('level_1', inplace = True)   #set the index back to the original (same as df).
    df_cln = cln.join(df, how = 'left', rsuffix = '_right')   # join the cleaned dataframe with the original, discarding the duplicate rows using a left join.
    return df_cln
+4
1

:

  • .
  • (, ).

, .

In [11]: g = df.groupby(["unique_id", "period_id"], as_index=False)

In [12]: g.transform("max")
Out[12]:
            liq
index
19     0.017610
1903   1.726374
12404  1.726374
13330  1.397143
14090  2.000716
12404  0.090525
13330  1.397143
14090  2.000716

In [13]: df.update(g.transform("max"))

In [14]: g.nth(0)
Out[14]:
          unique_id  period_id       liq
index
19         CAN00CE0     199001  0.017610
1903   **USA07WG0**     199001  1.726374
13330      USA08DE0     199001  1.397143
14090      USA04U80     199001  2.000716
12404      USA07WG0     199002  0.090525
13330      USA08DE0     199002  1.397143
14090      USA04U80     199002  2.000716

. first last , , , , , ... nth - , .


, liq max:

(df[df["liq"] == g["liq"].transform("max")]  #  keep only max liq rows
 .groupby(["unique_id", "period_id"])
 .nth(0)
+4

Source: https://habr.com/ru/post/1621315/


All Articles