Pandas - return the first element in a dataframe, grouped by user

I have a lot of user / item / time data. I want to know what items were consumed first, second, etc. By all users.

My questions are: if I have a data frame that is already sorted by time (descending), will it remain sorted by default in the groupby process? and how can I pull out the first two elements consumed by any user, even if the user has not used two elements?

 import pandas as pd df = pd.DataFrame({'item_id': ['b', 'b', 'a', 'c', 'a', 'b'], 'user_id': [1,2,1,1,3,1], 'time': range(6)}) print df pd.get_dummies(df['item_id']) gp = df.groupby('user_id').head() print gp # Return item_id of first one installed in each case ?? 

This gives:

  item_id time user_id 0 b 0 1 1 b 1 2 2 a 2 1 3 c 3 1 4 a 4 3 5 b 5 1 item_id time user_id user_id 1 0 b 0 1 2 a 2 1 3 c 3 1 5 b 5 1 2 1 b 1 2 3 4 a 4 3 

Now I need to pull out the top two values โ€‹โ€‹of item_id, something like this (but saving the user_id column doesn't matter):

 user_id order item_id 1 0 b 1 1 a 2 0 b 3 0 a 
+4
source share
2 answers

Here is the hack:

 In [75]: def nth_order(x, n): ....: xn = x[:n] ....: return xn.join(Series(arange(len(xn)), name='order', index=xn.index)) ....: In [76]: df.groupby('user_id').apply(lambda x: nth_order(x, 2)) Out[76]: item_id time user_id order user_id 1 0 b 0 1 0 2 a 2 1 1 2 1 b 1 2 0 3 4 a 4 3 0 

Note that you cannot just use n because you may have a group where len(group) < 2 , therefore

len(x[:n]) != n

in each case (according to your question).

This is a feature of this particular kind of slicing in pandas: if you cut the end, here you will only get every row (and there cannot be n rows), whereas iloc indexing , this is not true. That is, an exception will be raised if you try to truncate the last end of the array.

+4
source

You can do this directly with head , which gets the top results n ):

 In [11]: g = df.groupby('user_id') In [12]: g.head(2) Out[12]: item_id time user_id user_id 1 0 b 0 1 2 a 2 1 2 1 b 1 2 3 4 a 4 3 

Starting at 0.13 IIRC, it is much faster than any application-based solution head (the calling head is used as a failure before .apply(lambda x: x.head()) .

The implementation uses cumcount , so it is similar to the PhilipCloud solution.

+2
source

Source: https://habr.com/ru/post/1498507/


All Articles