Using Python 3.3 and Pandas 0.10
I have a DataFrame that is built from concatenating multiple CSV files. First, I filter out all the values โโin the Name column that contain a specific row. The result looks something like this (shortened for brevity, there are actually more columns):
Name ID 'A' 1 'B' 2 'C' 3 'C' 3 'E' 4 'F' 4 ... ...
Now my problem is that I want to remove the special case of duplicate values. I want to remove all duplicate identifiers (actually a whole line), where the corresponding name values โโassociated with this identifier are similar to not . In the above example, I would like to save the rows with identifiers 1, 2, and 3. If ID = 4, the name values โโare not equal, and I want to delete them.
I tried using the following line of code (based on the sentence here: Python Pandas: delete entries based on the number of occurrences ).
the code:
df[df.groupby('ID').apply(lambda g: len({x for x in g['Name']})) == 1]
However, this gives me an error: ValueError: Item wrong length 51906 instead of 109565!
Edit:
Instead of using apply() I also tried using transform() , however this gives me an error: AttributeError: 'int' object has no attribute 'ndim' . An explanation of why the error is different for each function is greatly appreciated!
In addition, I want to save all rows where ID = 3 in the above example.
Thanks in advance, Matthijs
source share