Fetching one record for a unique value (pandas, python)

I work with python-pandas file frames and I have a large framework containing users and their data. Each user can have several lines. I want to try 1-line per user. My current solution seems to be ineffective:

df1 = pd.DataFrame({'User': ['user1', 'user1', 'user2', 'user3', 'user2', 'user3'],
                 'B': ['B', 'B1', 'B2', 'B3','B4','B5'],
                 'C': ['C', 'C1', 'C2', 'C3','C4','C5'],
                 'D': ['D', 'D1', 'D2', 'D3','D4','D5'],
                 'E': ['E', 'E1', 'E2', 'E3','E4','E5']},
                 index=[0, 1, 2, 3,4,5])

df1
>>  B   C   D   E   User
0   B   C   D   E   user1
1   B1  C1  D1  E1  user1
2   B2  C2  D2  E2  user2
3   B3  C3  D3  E3  user3
4   B4  C4  D4  E4  user2
5   B5  C5  D5  E5  user3

userList = list(df1.User.unique())
userList
> ['user1', 'user2', 'user3']

I cycle through a unique list of users and selecting one row for each user, saving them to a different data format

usersSample = pd.DataFrame() # empty dataframe, to save samples
for i in userList:
    usersSample=usersSample.append(df1[df1.User == i].sample(1)) 

> usersSample   
B   C   D   E   User
0   B   C   D   E   user1
4   B4  C4  D4  E4  user2
3   B3  C3  D3  E3  user3

Is there a more efficient way to achieve this? I would really like to: 1) avoid adding dataframeSample to users. This is a gradually growing facility, and it seriously kills productivity. And 2) avoid looping through users one at a time. Is there a better way to try 1-for-user?

+4
3

, :

df1.groupby('User').apply(lambda df: df.sample(1))

enter image description here

:

df1.groupby('User', group_keys=False).apply(lambda df: df.sample(1))

enter image description here

+9

:

df.sample(frac=1).drop_duplicates(['User'])
+3
df1_user_sample_one = df1.groupby('User').apply(lambda x:x.sample(1)) 

Using DataFrame.groupby.apply and lambda functions for sample 1

0
source

Source: https://habr.com/ru/post/1647985/


All Articles