Python: how to find item triplets visited by user triplets

I have a CSV file containing pairs of elements visited by users, for example:

user_id item_id
370 293
471 380
280 100
280 118
219 118
...

The list is long - 30M lines.

I need to find the triplets of items that three users visited (i.e. all three users visited all three elements). Such triplets are rare. An example of the result I'm trying to find:

user_id item_id
1  15
1  26
1  31
77 15
77 26
77 31
45 15
45 26
45 31

What is a good way to do this? I can use Pandas or any other library.

+4
source share
1 answer

You can use transformwith sizeand then filter by boolean indexing:

print (df)
    user_id  item_id
0         1       15
1         1       26
2         1       31
3        77       15
4        77       26
5        77       31
6        45       15
7        45       26
8        45       31
9       370      293
10      471      380
11      280      100
12      280      118
13      219      118
print (df.groupby('user_id')['item_id'].transform('size'))
0     3
1     3
2     3
3     3
4     3
5     3
6     3
7     3
8     3
9     1
10    1
11    2
12    2
13    1
Name: item_id, dtype: int64

print (df[df.groupby('user_id')['item_id'].transform('size') == 3])
   user_id  item_id
0        1       15
1        1       26
2        1       31
3       77       15
4       77       26
5       77       31
6       45       15
7       45       26
8       45       31

Solution c filtrationworks slower:

df = df.groupby('user_id').filter(lambda x: len(x.item_id) == 3)
print (df)
   user_id  item_id
0        1       15
1        1       26
2        1       31
3       77       15
4       77       26
5       77       31
6       45       15
7       45       26
8       45       31
+2

Source: https://habr.com/ru/post/1671787/


All Articles