Pandas: combining duplicate lines

I have a dataframe

ID url date active_seconds 111 vk.com 12.01.2016 5 111 facebook.com 12.01.2016 4 111 facebook.com 12.01.2016 3 111 twitter.com 12.01.2016 12 222 vk.com 12.01.2016 8 222 twitter.com 12.01.2016 34 111 facebook.com 12.01.2016 5 

and i need to get

 ID url date active_seconds 111 vk.com 12.01.2016 5 111 facebook.com 12.01.2016 7 111 twitter.com 12.01.2016 12 222 vk.com 12.01.2016 8 222 twitter.com 12.01.2016 34 111 facebook.com 12.01.2016 5 

If i try

 df.groupby(['ID', 'url'])['active_seconds'].sum() 

concatenates all lines. How do I do to desire?

+5
source share
3 answers
  • (s != s.shift()).cumsum() is a typical way of identifying groups of adjacent identifiers
  • pd.DataFrame.assign is a convenient way to add a new column to a copy of the data and a chain of additional methods.
  • pivot_table allows us to reconfigure our table and aggregate
  • args is my style preference to keep code clean. I will pass these pivot_table arguments through *args
  • reset_index * 2 to clear and get the final result

 args = ('active_seconds', ['g', 'ID', 'url', 'date'], None, 'sum') df.assign(g=df.ID.ne(df.ID.shift()).cumsum()).pivot_table(*args) \ .reset_index([1, 2, 3]).reset_index(drop=True) ID url date active_seconds 0 111 facebook.com 12.01.2016 7 1 111 twitter.com 12.01.2016 12 2 111 vk.com 12.01.2016 5 3 222 twitter.com 12.01.2016 34 4 222 vk.com 12.01.2016 8 5 111 facebook.com 12.01.2016 5 
+3
source

Solutions 1 - cumsum only on url column:

You need a groupby custom Series created by a cumsum boolean mask, but then the url column needs to aggregate with first . Then remove the url level with reset_index and the last reindex reordering reindex :

 g = (df.url != df.url.shift()).cumsum() print (g) 0 1 1 2 2 2 3 3 4 4 5 5 6 6 Name: url, dtype: int32 g = (df.url != df.url.shift()).cumsum() #another solution with ne #g = df.url.ne(df.url.shift()).cumsum() print (df.groupby([df.ID,df.date,g], sort=False).agg({'active_seconds':'sum', 'url':'first'}) .reset_index(level='url', drop=True) .reset_index() .reindex(columns=df.columns)) ID url date active_seconds 0 111 vk.com 12.01.2016 5 1 111 facebook.com 12.01.2016 7 2 111 twitter.com 12.01.2016 12 3 222 vk.com 12.01.2016 8 4 222 twitter.com 12.01.2016 34 5 111 facebook.com 12.01.2016 5 

 g = (df.url != df.url.shift()).cumsum().rename('tmp') print (g) 0 1 1 2 2 2 3 3 4 4 5 5 6 6 Name: tmp, dtype: int32 print (df.groupby([df.ID, df.url, df.date, g], sort=False)['active_seconds'] .sum() .reset_index(level='tmp', drop=True) .reset_index()) ID url date active_seconds 0 111 vk.com 12.01.2016 5 1 111 facebook.com 12.01.2016 7 2 111 twitter.com 12.01.2016 12 3 222 vk.com 12.01.2016 8 4 222 twitter.com 12.01.2016 34 5 111 facebook.com 12.01.2016 5 

Solutions 2 - cumsum by columns ID and url :

 g = df[['ID','url']].ne(df[['ID','url']].shift()).cumsum() print (g) ID url 0 1 1 1 1 2 2 1 2 3 1 3 4 2 4 5 2 5 6 3 6 print (df.groupby([g.ID, df.date, g.url], sort=False) .agg({'active_seconds':'sum', 'url':'first'}) .reset_index(level='url', drop=True) .reset_index() .reindex(columns=df.columns)) ID url date active_seconds 0 1 vk.com 12.01.2016 5 1 1 facebook.com 12.01.2016 7 2 1 twitter.com 12.01.2016 12 3 2 vk.com 12.01.2016 8 4 2 twitter.com 12.01.2016 34 5 3 facebook.com 12.01.2016 5 

And the solution is where to add the df.url column, but you need to rename columns in helper df :

 g = df[['ID','url']].ne(df[['ID','url']].shift()).cumsum() g.columns = g.columns + '1' print (g) ID1 url1 0 1 1 1 1 2 2 1 2 3 1 3 4 2 4 5 2 5 6 3 6 print (df.groupby([df.ID, df.url, df.date, g.ID1, g.url1], sort=False)['active_seconds'] .sum() .reset_index(level=['ID1','url1'], drop=True) .reset_index()) ID url date active_seconds 0 111 vk.com 12.01.2016 5 1 111 facebook.com 12.01.2016 7 2 111 twitter.com 12.01.2016 12 3 222 vk.com 12.01.2016 8 4 222 twitter.com 12.01.2016 34 5 111 facebook.com 12.01.2016 5 

Delay

Similar solutions, but pivot_table slower than groupby :

 In [180]: %timeit (df.assign(g=df.ID.ne(df.ID.shift()).cumsum()).pivot_table('active_seconds', ['g', 'ID', 'url', 'date'], None, 'sum').reset_index([1, 2, 3]).reset_index(drop=True)) 100 loops, best of 3: 5.02 ms per loop In [181]: %timeit (df.groupby([df.ID, df.url, df.date, (df.url != df.url.shift()).cumsum().rename('tmp')], sort=False)['active_seconds'].sum().reset_index(level='tmp', drop=True).reset_index()) 100 loops, best of 3: 3.62 ms per loop 
+3
source

it looks like you want cumsum() :

 In [195]: df.groupby(['ID', 'url'])['active_seconds'].cumsum() Out[195]: 0 5 1 4 2 7 3 12 4 8 5 34 6 12 Name: active_seconds, dtype: int64 
+2
source

Source: https://habr.com/ru/post/1262737/


All Articles