Pandas - Threshold and Duplicate Group

I have the following data:

userid itemid
  1       1
  1       1
  1       3
  1       4
  2       1
  2       2
  2       3

I want to remove the ids of users who viewed the same item id that is greater than or equal to twice. For example, userid = 1 looked at itemid = 1 twice, and therefore I want to delete the entire userid = 1 entry. However, since userid = 2 did not view the same item twice, I will leave userid = 2 as it is.

So I want my data to look like this:

userid itemid
  2       1
  2       2
  2       3

Can someone help me?

import pandas as pd    
df = pd.DataFrame({'userid':[1,1,1,1, 2,2,2],
                   'itemid':[1,1,3,4, 1,2,3] })
+4
source share
5 answers

duplicated, , groupby on 'userid', 'userid' , .

:

df = df[~df.duplicated(['userid', 'itemid']).groupby(df['userid']).transform('any')]

, keep=False duplicated . , 3:

df = df[~df.duplicated(['userid', 'itemid'], keep=False).groupby(df['userid']).transform('sum').ge(3)]

:

   userid  itemid
4       2       1
5       2       2
6       2       3
+6

filter

. , , , .

filter value_counts

df.groupby('userid').filter(lambda x: x.itemid.value_counts().max() < 2)

filter is_unique
n < 2

df.groupby('userid').filter(lambda x: x.itemid.is_unique)

   userid  itemid
4       2       1
5       2       2
6       2       3
+4

:

views = df.groupby(['userid','itemid'])['itemid'].count()
#userid  itemid
#1       1         2 <=== The offending row
#        3         1
#        4         1
#2       1         1
#        2         1
#        3         1
#Name: dummy, dtype: int64

, - :

THRESHOLD = 2
viewed = ~(views.unstack() >= THRESHOLD).any(axis=1)
#userid
#1    False
#2     True
#dtype: bool

"" :

combined = df.merge(pd.DataFrame(viewed).reset_index())
combined[combined[0]][['userid','itemid']]
#   userid  itemid
#4       2       1
#5       2       2
#6       2       3
+3
# group userid and itemid and get a count
df2 = df.groupby(by=['userid','itemid']).apply(lambda x: len(x)).reset_index()
#Extract rows where the max userid-itemid count is less than 2.
df2 = df2[~df2.userid.isin(df2[df2.ix[:,-1]>1]['userid'])][df.columns]
print(df2)
   itemid  userid
3       1       2
4       2       2
5       3       2

,

df2.ix[:,-1]>threshold]
+2

I do not know if there is a function available in Pandasto perform this task. However, I tried to make a workaround to solve your problem.

Here is the complete code.

import pandas as pd
dictionary = {'userid':[1,1,1,1,2,2,2],
              'itemid':[1,1,3,4,1,2,3]}

df = pd.DataFrame(dictionary, columns=['userid', 'itemid'])

selected_user = []

for user in df['userid'].drop_duplicates().tolist():

    items = df.loc[df['userid']==user]['itemid'].tolist()
    if len(items) != len(set(items)): continue
    else: selected_user.append(user)

result = df.loc[(df['userid'].isin(selected_user))]

This code will produce the following result.

    userid  itemid
4   2       1
5   2       2
6   2       3

Hope this helps.

0
source

Source: https://habr.com/ru/post/1676042/


All Articles