Pandas: filtering a DataFrame using groupby and function

Using Python 3.3 and Pandas 0.10

I have a DataFrame that is built from concatenating multiple CSV files. First, I filter out all the values โ€‹โ€‹in the Name column that contain a specific row. The result looks something like this (shortened for brevity, there are actually more columns):

Name ID 'A' 1 'B' 2 'C' 3 'C' 3 'E' 4 'F' 4 ... ... 

Now my problem is that I want to remove the special case of duplicate values. I want to remove all duplicate identifiers (actually a whole line), where the corresponding name values โ€‹โ€‹associated with this identifier are similar to not . In the above example, I would like to save the rows with identifiers 1, 2, and 3. If ID = 4, the name values โ€‹โ€‹are not equal, and I want to delete them.

I tried using the following line of code (based on the sentence here: Python Pandas: delete entries based on the number of occurrences ).

the code:

 df[df.groupby('ID').apply(lambda g: len({x for x in g['Name']})) == 1] 

However, this gives me an error: ValueError: Item wrong length 51906 instead of 109565!

Edit:

Instead of using apply() I also tried using transform() , however this gives me an error: AttributeError: 'int' object has no attribute 'ndim' . An explanation of why the error is different for each function is greatly appreciated!

In addition, I want to save all rows where ID = 3 in the above example.

Thanks in advance, Matthijs

+4
source share
2 answers

Instead of len length, I think you want to consider the number of unique Name values โ€‹โ€‹in each group. Use nunique() and check out this neat recipe for filtering groups.

 df[df.groupby('ID').Name.transform(lambda x: x.nunique() == 1).astype('bool')] 

If you upgrade to pandas 0.12, you can use the new filter method for groups, which makes it more concise and simple.

 df.groupby('ID').filter(lambda x: x.Name.nunique() == 1) 

General note: Sometimes, of course, you want to know the length of the group, but I think that size is a safer choice than len , which in some cases was unpleasant for me.

+5
source

You can remove duplicates first:

 In [11]: df = df.drop_duplicates() In [12]: df Out[12]: Name ID 0 A 1 1 B 2 2 C 3 4 E 4 5 F 4 

The groupby identifier and groupby only those who have one element:

 In [13]: g = df.groupby('ID') In [14]: size = (g.size() == 1) In [15]: size Out[15]: ID 1 True 2 True 3 True 4 False dtype: bool In [16]: size[size].index Out[16]: Int64Index([1, 2, 3], dtype=int64) In [17]: df['ID'].isin(size[size].index) Out[17]: 0 True 1 True 2 True 4 False 5 False Name: ID, dtype: bool 

And the boolean index:

 In [18]: df[df['ID'].isin(size[size].index)] Out[18]: Name ID 0 A 1 1 B 2 2 C 3 
0
source

Source: https://habr.com/ru/post/1494325/


All Articles